Loss reduction in tf.distributed.MirroredStrategy() - tensorflow

I'm confused regarding using distributed strategy and the correct way of reduction in loss functions.
I implemented a U-Net using tf.distribute.MirroredStrategy(). Everything works fine using default loss BinaryCrossentropy as follows:
with strategy.scope():
model = build_network((size, size, 3), num_classes)
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=args.learning_rate),
loss=tf.keras.losses.BinaryCrossentropy()])
However, I want to create custom loss functions. To start with, I wrote a wrapper containing BinaryCrossentropy, to get familiar with the correct way of using the reduction methods. I followed the instructions in https://www.tensorflow.org/tutorials/distribute/custom_training#define_the_loss_function
and used tf.nn.compute_average_loss in order to divide by the global batch_size.
def loss_functions(loss_spec):
if loss_spec == 'cross_entropy':
def c_loss(truth, pred):
my_loss = tf.keras.losses.BinaryCrossentropy(reduction=tf.keras.losses.Reduction.NONE)(truth, pred)
my_loss = tf.math.reduce_mean(my_loss, axis=[1, 2]) # to compute average across the two image dimensions
my_loss = tf.nn.compute_average_loss(my_loss) # sums up all items and divides by batch size
return my_loss
return c_loss
which is called in the following way:
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=args.learning_rate),
loss=utils.loss_functions('cross_entropy')])
It also works, but I realised a difference of factor of number or replica compared to using tf.keras.losses.BinaryCrossentropy(). I.e., when using two kernels, using BinaryCrossentropy() directly yields a loss twice as large as my custom loss. Thus, to geht the same, I would need to divide by the batch size per replica instead of global batch size, i.e., the way it should NOT be done according to the documentation.
However, the documentation refers to building an own training routine, whereas I am using model.compile() and model.fit() methods.
Can anybody explain this behaviour to me?
UPDATE:
The use of tf.nn.compute_average_loss or the use of any reduction on the batch axis is not needed when using model.compile() and model.fit() at all - the reduction and scaling is done automatically. However, I still do not know how model.fit() does work internally.
Thanks and cheers, everybody

Related

Is it possible to integrate Levenberg-Marquardt optimizer from Tensorflow Graphics with a Tensorflow 2.0 model?

I have a Tensorflow 2.0 tf.keras.Sequential model. Now, my technical specification prescribes using the Levenberg-Marquardt optimizer to fit the model. Tensorflow 2.0 doesn't provide it as an optimizer out of the box, but it is available in the Tensorflow Graphics module.
tfg.math.optimizer.levenberg_marquardt.minimize function accepts residuals ( a residual is a Python callable returning a tensor) and variables (list of tensors corresponding to my model weights) as parameters.
What would be the best way to convert my model into residuals and variables?
If I understand correctly how the minimize function works, I have to provide two residuals. The first residual must call my model for every learning case and aggregate all the results into a tensor. The second residuals must return all labels as a single constant tensor. The problem is that tf.keras.Sequential.predict function returns a numpy array instead of tensor. I believe that if I convert it to a tensor, the minimizer won't be able to calculate jacobians with respect to variables.
The same problem is with variables. It doesn't seem like there's a way to extract all weights from a model into a list of tensors.
There's a major difference between tfg.math.optimizer.levenberg_marquardt.minimize and Keras optimizers from the implementation/API perspective.
Keras optimizers, such as tf.keras.optimizers.Adam consume gradients as input and updates tf.Variables.
In contrast, tfg.math.optimizer.levenberg_marquardt.minimize essentially unrolls the optimization loop in graph mode (using a tf.while_loop construct). It takes initial parameter values and produces updated parameter values, unlike Adam & co, which only apply one iteration and actually change the values of tf.Variables via assign_add.
Stepping back a bit to the theoretical big picture, Levenberg-Marquardt is not a general gradient descent-like solver for any nonlinear optimization problem (such as Adam is). It specifically addresses nonlinear least-squares optimization, so it's not a drop-in replacement for optimizers like Adam. In gradient descent, we compute the gradient of the loss with respect to the parameters. In Levenberg-Marquardt, we compute the Jacobian of the residuals with respect to the parameters. Concretely, it repeatedly solves the linearized problem Jacobian # delta_params = residuals for delta_params using tf.linalg.lstsq (which internally uses Cholesky decomposition on the Gram matrix computed from the Jacobian) and applies delta_params as the update.
Note that this lstsq operation has cubic complexity in the number of parameters, so in case of neural nets it can only be applied for fairly small ones.
Also note that Levenberg-Marquardt is usually applied as a batch algorithm, not a minibatch algorithm like SGD, though there's nothing stopping you from applying the LM iteration on different minibatches in each iteration.
I think you may only be able to get one iteration out of tfg's LM algorithm, through something like
from tensorflow_graphics.math.optimizer.levenberg_marquardt import minimize as lm_minimize
for input_batch, target_batch in dataset:
def residual_fn(trainable_params):
# do not use trainable params, it will still be at its initial value, since we only do one iteration of Levenberg Marquardt each time.
return model(input_batch) - target_batch
new_objective_value, new_params = lm_minimize(residual_fn, model.trainable_variables, max_iter=1)
for var, new_param in zip(model.trainable_variables, new_params):
var.assign(new_param)
In contrast, I believe the following naive method will not work where we assign model parameters before computing the residuals:
from tensorflow_graphics.math.optimizer.levenberg_marquardt import minimize as lm_minimize
dataset_iterator = ...
def residual_fn(params):
input_batch, target_batch = next(dataset_iterator)
for var, param in zip(model.trainable_variables, params):
var.assign(param)
return model(input_batch) - target_batch
final_objective, final_params = lm_minimize(residual_fn, model.trainable_variables, max_iter=10000)
for var, final_param in zip(model.trainable_variables, final_params):
var.assign(final_param)
The main conceptual problem is that residual_fn's output has no gradients wrt its input params, since this dependency goes through a tf.assign. But it might even fail before that due to using constructs that are disallowed in graph mode.
Overall I believe it's best to write your own LM optimizer that works on tf.Variables, since tfg.math.optimizer.levenberg_marquardt.minimize has a very different API that is not really suited for optimizing Keras model parameters since you can't directly compute model(input, parameters) - target_value without a tf.assign.

Tensorflow batch normalization: difference momentum and renorm_momentum

I want to replicate a network build with the lasagne-library in tensor flow. I'm having some trouble with the batch normalization.
This is the lasagne documentation about the used batch normalization:
http://lasagne.readthedocs.io/en/latest/modules/layers/normalization.html?highlight=batchNorm
In tensorflow I found two functions to normalize:
https://www.tensorflow.org/api_docs/python/tf/nn/batch_normalization
https://www.tensorflow.org/api_docs/python/tf/layers/batch_normalization
The first one is simpler but does not let me choose the alpha parameter from lasagne (Coefficient for the exponential moving average of batch-wise means and standard deviations computed during training). I tried using the second function, which has a lot more options, but there are two things I do not understand about it:
I am not clear about the difference between momentum and renorm_momentum. If I have a alpha of 0.9 in the lasagne network, can I just set both tensorflow momentums to 0.9 and expect the same behaviour?
The tf documentation notes:
when training, the moving_mean and moving_variance need to be updated. By default the update ops are placed in tf.GraphKeys.UPDATE_OPS, so they need to be added as a dependency to the train_op. For example:
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
train_op = optimizer.minimize(loss)
I do not really understand what is happening here and where I need to put something similar in my code. Can I just put this somewhere before I run the session? What parts of this code piece should I not copy literally but change depending on my code?
There is a big difference between tf.nn.batch_normalization and tf.layers.batch_normalization. See my answer here. So you have made the right choice by using the layers version. Now, on your questions:
renorm_momentum only has an effect is you use batch renormalization by setting the renorm argument to True. You can ignore this if using default batch normalization.
Short answer: You can literally copy that code snippet. Put it exactly where you would normally call optimizer.minimize.
Long answer on 2.: Batch normalization has two "modes": Training and inference. During training, mean and variance of the current minibatch is used. During inference, this is not desirable (e.g. you might not even use batches as input, so there would be no minibatch statistics). For this reason, moving averages over minibatch means/variances are kept during training. These moving averages are then used for inference.
By default, Tensorflow only executes what it needs to. Those moving averages are not needed for training, so they normally would never be executed/updated. The tf.control_dependencies context manager forces Tensorflow to do the updates every time it computes whatever is in the code block (in this case the cost). Since the cost certainly needs to be computed exactly one per training step, this is a good way of making sure the moving averages are updated.
The code example seems a bit arcane, but in context it would really just be (as an example):
loss = ...
train_step = SomeOptimizer().minimize(loss)
with tf.Session() as sess:
....
becomes
loss = ...
with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)):
train_step = SomeOptimizer().minimize(loss)
with tf.Session() as sess:
....
Finally, keep in mind to use the correct training argument for batch normalization so that either minibatch statistics or moving averages are used as intended.

does TensorFlow automatically use sparse_softmax_cross_entropy_with_logits when possible?

Let's say that I have some code such as:
out = tf.nn.softmax(x) # shape (batch,time,n)
labels = .... # reference labels of type (batch,time)->int
And then I define my loss as the Cross Entropy:
loss = -tf.log(tf.gather_nd(out, labels))
Will TensorFlow automatically replace the loss in the computation graph by this?
loss = sparse_softmax_cross_entropy_with_logits(x, labels)
What type of optimizations can I expect that TensorFlow will apply?
Follow-up question: If TensorFlow doesn't do this optimization, how can I do it manually? Consider that I have a modular framework where I get some out tensor which could possibly be the output of a softmax operation, and I want to calculate Cross Entropy, and I want to use sparse_softmax_cross_entropy_with_logits if possible. How could I accomplish this? Can I do something like the following?
if out.op == "softmax": # how to check this?
x = out.op.sources[0] # how to get this?
loss = sparse_softmax_cross_entropy_with_logits(x, labels)
else:
loss = -tf.log(tf.gather_nd(out, labels))
TensorFlow generally doesn't merge nodes together in the way you're hoping. This is because other code (e.g. fetching outputs when running) may depend on intermediate nodes like the softmax, so removing them behind the user's back would be confusing.
If you do want to do this optimization yourself as part of a higher-level framework, you can analyze the current graphdef, but there's no annotation in TF to tell you what the outputs are, since that can vary at runtime depending on how session.run is called.

Tracking counts of examples used in training

I am trying to implement the CBOW word2vec model based on the skipgrams implementation on the tensorflow repository:
https://github.com/tensorflow/tensorflow/blob/v0.10.0/tensorflow/models/embedding/word2vec.py
I have previously implemented the simplified version following the TensorFlow tutorials, so I understand that I will have to modify the data batching function as well as a small part of the graph to get the context embedding.
In the skipgram implementation, the data batching function is used in lines 348-351.
(words, counts, words_per_epoch, self._epoch, self._words, examples,
labels) = word2vec.skipgram(filename=opts.train_data,
batch_size=opts.batch_size,
window_size=opts.window_size,
min_count=opts.min_count,
subsample=opts.subsample)
From my understanding, the variables assigned are as follows:
words: terms in the vocabulary
counts: associated counts of terms used in the corpus
words_per_epoch: total word count in the corpus
self._epoch: current count of epochs used
self._words: current count of training examples used
examples: current batch of training examples
labels: current batch of training labels
I have managed to replicate the tensor for words, counts, words_per_epoch, examples and labels. However, self._epoch and self._words have eluded me. If my understanding is correct, I need to be able to track the count of the training examples used. However, this is not provided by the sample batching function. The counts are later used in a multi-threaded manner to terminate the training loop, hence I can't simply use a loop to add up the counts.
I understand that bits of the tensorflow ops are implemented in C++. However, as I am not familiar with C++, I will have to replicate those parts using Python.
Will be great if I can get some suggestions to obtain the tensor for self._words. The tensor basically has to increment only when every time a new batch of examples/labels are called. With that, I can simply use a self._epoch = self._words // words_per_epoch to get the other tensor.
Figured out the trick while looking at the source code for tensorflow.models.embedding.word2vec_optimized.py. Specifically, how global_step was incremented when loss was called in lines 218-225.
In my case, I would have to do it as so:
# codes to prepare features and labels tensors
data_processed = tf.Variable(0, trainable=False, dtype=tf.int64)
epochs_processed = data_processed // data_per_epoch
inc_op = data_processed.assign_add(batch_size)
with tf.control_dependencies([inc_op]):
features_batch, labels_batch = tf.train.batch([features, labels],
batch_size=batch_size)
In this case, the tensor data_processed will always be incremented by batch_size whenever features_batch or labels_batch is called. epochs_processed will also be incremented accordingly.
The use of tf.control_dependencies(control_inputs) is key here. It returns a context manager. The operations specified in control_inputs must be executed before the operations defined in the context.

Caching Computations in TensorFlow

Is there a canonical way to reuse computations from a previously-supplied placeholder in TensorFlow? My specific use case:
supply many inputs (using one placeholder) simultaneously, all of which are fed through a network to obtain smaller representations
define a loss based on various combinations of these smaller representations
train on one batch at a time, where each batch uses some subset of the inputs, without recomputing the smaller representations
Here is the goal in code, but which is defective because the same computations are carried out again and again:
X_in = some_fixed_data
combinations_in = large_set_of_combination_indices
for combination_batch_in in batches(combinations_in, batch_size=128):
session.run(train_op, feed_dict={X: X_in, combinations: combination_batch_in})
Thanks.
The canonical way to share computed values across sess.Run() calls is to use a Variable. In this case, you could set up your graph so that when the Placeholders are fed, they compute a new value of the representation that is saved into a Variable. A separate portion of the graph reads those Variables to compute the loss. This will not work if you need to compute gradients through the part of the graph that computes the representation. Computing those gradients will require recomputing every Op in the encoder.
This is the kind of thing that should be solved automatically with CSE (common subexpression elimination). Not sure what the support in TensorFlow right now, might be kind of spotty, but there's optimizer_do_cse flag for Graph options which is defaulting to false, and you can set it to true using GraphConstructorOptions. Here's a C++ example of using GraphConstructorOptions (sorry, couldn't find a Python one)
If that doesn't work, you could do "manual CSE", ie, figure out which part is being needlessly recomputed, factor it out into separate Tensor, and reference that tensor in all the calculations.