Tensorflow optimizers: loss sum vs mean - tensorflow

I'm wondering if the Tensorflow optimizers (in particular the AdamOptimizer) have a preference when it comes to defining a loss function as a sum or as a mean/average over a minibatch?
In general my assumption was that using the mean is preferred, because the loss does not depend with the size of the mini batches. Thus, it is easier to find a learning rate which works with any batch size.
However, Tensorflow defines e.g. l2_loss internally as:
output = sum(t ** 2) / 2
Does this imply that the optimizers account for the batch size internally already, i.e., they expect losses to scale linearly with the batch size? Also, what's the motivation of taking half the L2 norm from the perspective of optimization?

Well here l2_loss is actually a regularization loss function. We add that inside our main loss functions inorder to prevent the parameters from over fitting. We normally divide the l2 loss by 2 inorder to make it easy when taking the gradients.
And inside any optimizer we take the average loss w.r.t batch size.

Related

Multiple questions regarding the KL term in the ELBO loss with TensorFlow Probability

I have been trying to conduct a few experiments using TensorFlow Probability (TFP), and I got a few questions.
What is the proper value of the coefficient of the KL loss?
In the paper by Blundell (2015), the coefficient is set to 1/M (where M is the number of mini-batches). In the example given by TFP, the coefficient is given as 1/mnist_data.train.num_examples. Why?
As I go from 2d input to 3d images volumes, the KL loss is still significantly larger (~1k) than the cross-entropy (~1), even after dividing by mnist_data.train.num_examples. Why?
What is the guideline for getting a proper value for this coefficient? Maybe like the two-loss terms should be the same order of magnitude?
The current coefficient only takes care of the number of training samples, but not the network complexity or number of parameters in the network, which I assume the KL loss increase with the complexity of the model.
I am trying to implement a neural network with the KL loss, without using keras.model.losses, as some software production and hardware support limitation. I am trying to train my model with TF 1.10 and TFP 0.3.0., the issue is that for tf<=1.14, tf.keras.model does not support tf.layers inside the Keras model, so I can't use my original model straight away. Is there a way to get the KL loss, not from model.losses, but from layers or weights of the network in a TF construct?
Is batch normalization or group normalization still helpful in Bayesian deep learning?
In the paper by Blundell (2015), the coefficient is set to 1/M (where M is the number of mini-batches). In the example given by TFP, the coefficient is given as 1/mnist_data.train.num_examples. Why?
In the BBB paper eq. 8, they refer to M being the number of mini-batches. To be consistent with the non-stochastic gradient learning, it should be scaled by the number of mini-batches which is what is done by Graves. Another alternative is that done in eq. 9, where they scale it by \pi_i, where the sum of all the values in the set {\pi} sum to one.
In the TFP example, it does look like the num_examples is the total number of independent samples within the training set, which is much larger than the number of batches. This is goes by a few names, such as Safe Bayes or Tempering. Have a look at sec. 8 of this paper for some more discussion about the use of tempering within Bayesian inference and it's suitability.
As I go from 2d input to 3d images volumes, the KL loss is still significantly larger (~1k) than the cross-entropy (~1), even after dividing by mnist_data.train.num_examples. Why?
The ELBO will always be larger than just your cross-entropy (which defines your likelihood). Have a look at how the KL divergence term in the ELBO is found. (and a full mean-field approach where each weight/parameter is assumed to be independent).
Since the assumed posterior is factorised (assume each parameter is independent), can write the joint distribution as a product. This means when you take the log when you are computing the KL between the approx. posterior and the prior, you can write it as a sum of the KL terms between each parameter. Since the KL is >= 0, for each parameter you add to your model you will be adding another positive term to your ELBO. This is likely why your loss is so much more for your 3D model, likely because there is more parameters.
Another reason this could occur is if you have less data (your M is smaller, than the KL term is weighted less).
What is the guideline for getting a proper value for this coefficient? Maybe like the two-loss terms should be the same order of magnitude?
I am unsure of any specific guideline, for training you are interested primarily in the gradients. A large loss does not mean a large gradient. Have a look at the gradients contributed by the negative log likelihood and the KL term in your ELBO. If the KL term is too large, you probably need a more informative prior or more data (you could simply scale the KL term but this feels a bit yucky for the Bayesian in me).
The current coefficient only takes care of the number of training samples, but not the network complexity or the number of parameters in the network, which I assume the KL loss increase with the complexity of the model.
Yes, as stated before, in general, more parameters == greater ELBO (for a mean-field approach as used in Bayes by Backprop).
I am trying to implement a neural network with the KL loss, without using keras.model.losses, as some software production and hardware support limitation. I am trying to train my model with TF 1.10 and TFP 0.3.0., the issue is that for tf<=1.14, tf.keras.model does not support tf.layers inside the Keras model, so I can't use my original model straight away. Is there a way to get the KL loss, not from model.losses, but from layers or weights of the network in a TF construct?
I am unsure about the best way to tackle this part of it. I would be cautious about going to older versions where it isn't explicitly supported. They put those warnings/exceptions in for a reason.
Is batch normalization or group normalization still helpful in Bayesian deep learning?
For variational inference (as done in Bayes by Backprop) Batchnorm is fine. For sampling methods such as MCMC, Batch normalization is no longer suitable. Have a look at https://arxiv.org/pdf/1908.03491v1.pdf for info on suitability for batch norm with sampling methods for approx. Bayesian inference.

An Efficient way to Calculate loss function batchwise?

I am using autoencoders to do anomaly detection. So, I have finished training my model and now I want to calculate the reconstruction loss for each entry in the dataset. so that I can assign anomalies to data points with high reconstruction loss.
This is my current code to calculate the reconstruction loss
But this is really slow. By my estimation, it should take 5 hours to go through the dataset whereas training one epoch occurs in approx 55 mins.
I feel that converting to tensor operation is bottlenecking the code, but I can't find a better way to do it.
I've tried changing the batch sizes but it does not make much of a difference. I have to use the convert to tensor part because K.eval is throwing an error if I do it normally.
python
for i in range(0, encoded_dataset.shape[0], batch_size):
y_true = tf.convert_to_tensor(encoded_dataset[i:i+batch_size].values,
np.float32)
y_pred= tf.convert_to_tensor(ae1.predict(encoded_dataset[i:i+batch_size].values),
np.float32)
# Append the batch losses (numpy array) to the list
reconstruction_loss_transaction.append(K.eval(loss_function( y_true, y_pred)))
I was able to train in 55 mins per epoch. So I feel prediction should not take 5 hours per epoch. encoded_dataset is a variable that has the entire dataset in main memory as a data frame.
I am using Azure VM instance.
K.eval(loss_function(y_true,y_pred) is to find the loss for each row of the batch
So y_true will be of size (batch_size,2000) and so will y_pred
K.eval(loss_function(y_true,y_pred) will give me an output of
(batch_size,1) evaluating binary cross entropy on each row of y
_true and y_pred
Moved from comments:
My suspicion is that ae1.predict and K.eval(loss_function) are behaving in unexpected ways. ae1.predict should normally be used to output the loss function value as well as y_pred. When you create the model, specify that the loss value is another output (you can have a list of multiple outputs), then just call predict here once to get both y_pred the loss value in one call.
But I want the loss for each row . Won't the loss returned by the predict method be the mean loss for the entire batch?
The answer depends on how the loss function is implemented. Both ways produce perfectly valid and identical results in TF under the hood. You could average the loss over the batch before taking the gradient w.r.t. the loss, or take the gradient w.r.t. a vector of losses. The gradient operation in TF will perform the averaging of the losses for you if you use the latter approach (see SO articles on taking the per-sample gradient, it's actually hard to do).
If Keras implements the loss with reduce_mean built into the loss, you could just define your own loss. If you're using square loss, replacing 'mean_squared_error' with lambda y_true, y_pred: tf.square(y_pred - y_true). That would produce square error instead of MSE (no difference to the gradient), but look here for the variant including the mean.
In any case this produces a per sample loss so long as you don't use tf.reduce_mean, which is purely optional in the loss. Another option is to simply compute the loss separately from what you optimize for and make that an output of the model, also perfectly valid.

How do I calculate subgradients in TensorFlow?

Does the automatic differentiation procedure in TensorFlow compute subgradient whenever needed? If there are many subgradients then which one will be chosen as output?
I am trying to implement the paper in the link https://www.aclweb.org/anthology/P13-1045 which uses recursive neural networks to perform efficient language parsing. The objective function uses hinge loss function to pick the optimal output vectors, which makes the function not differentiable. I used TensorFlow (v1.12) in eager mode to program the model and used the automatic differentiation to compute the gradients. After every batch, I could see the gradient values changing and the accuracy is slightly improved. After a while, it decreases and this process continues. The model does not converge at all for all the hyper-parameter configurations.
Mini batch size : 256, 512, 1024; Regularization parameters - 0.1, 0.01, 0.001; Learning rate - 0.1, 0.01, 0.001; Optimization function - gradient descent, adagrad, adam;
In the paper, they have described how to find subgradient for the optimum function in a very abstract manner, which I have not understood yet. I was of the opinion at the beginning that automatic gradient computation calculates the subgradient. But at this moment, I am starting to doubt so because that seems to be the only variable missing.
Unfortunately, Tensorflow does not computes subgradients, only gradients.
As explained here How does tensorflow handle non differentiable nodes during gradient calculation? .
To summarize, when computing a partial derivative, if there is a problem of differentiability, Tensorflow simply puts this derivative to be zero.
As for you having trouble training your model, there are no general rules saying how to tune the hyperparameters, thus, I would suggest to do a grid search on the learning rates (on a few epochs) to find a good initial learning rate which provide good results for one of the optimization algorithms. Usually, ADAM or SGD with momentum provide satisfying results when choosing a right initial learning rate.

Different between fit and evaluate in keras

I have used 100000 samples to train a general model in Keras and achieve good performance. Then, for a particular sample, I want to use the trained weights as initialization and continue to optimize the weights to further optimize the loss of the particular sample.
However, the problem occurred. First, I load the trained weight by the keras API easily, then, I evaluate the loss of the one particular sample, and the loss is close to the loss of the validation loss during the training of the model. I think it is normal. However, when I use the trained weight as the inital and further optimize the weight over the one sample by model.fit(), the loss is really strange. It is much higher than the evaluate result and gradually became normal after several epochs.
I think it is strange that, for the same one simple and loading the same model weight, why the model.fit() and model.evaluate() return different results. I used batch normalization layers in my model and I wonder that it may be the reason. The result of model.evaluate() seems normal, as it is close to what I seen in the validation set before.
So what cause the different between fit and evaluation? How can I solve it?
I think your core issue is that you are observing two different loss values during fit and evaluate. This has been extensively discussed here, here, here and here.
The fit() function loss includes contributions from:
Regularizers: L1/L2 regularization loss will be added during training, increasing the loss value
Batch norm variations: during batch norm, running mean and variance of the batch will be collected and then those statistics will be used to perform normalization irrespective of whether batch norm is set to trainable or not. See here for more discussion on that.
Multiple batches: Of course, the training loss will be averaged over multiple batches. So if you take average of first 100 batches and evaluate on the 100th batch only, the results will be different.
Whereas for evaluate, just do forward propagation and you get the loss value, nothing random here.
Bottomline is, you should not compare train and validation loss (or fit and evaluate loss). Those functions do different things. Look for other metrics to check if your model is training fine.

Multi GPU architecture, gradient averaging - less accurate model?

When I execute the cifar10 model as described at https://www.tensorflow.org/tutorials/deep_cnn I achieve 86% accuracy after approx 4 hours using a single GPU , when I utilize 2 GPU's the accuracy drops to 84% but reaching 84% accuracy is faster on 2 GPU's than 1.
My intuition is
that average_gradients function as defined at https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10/cifar10_multi_gpu_train.py returns a less accurate gradient value as an average of gradients will be less accurate than the actual gradient value.
If the gradients are less accurate then the parameters than control the function that is learned as part of training is less accurate. Looking at the code (https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10/cifar10_multi_gpu_train.py) why is averaging the gradients over multiple GPU's less accurate than computing the gradient on a single GPU ?
Is my intuition of averaging the gradients producing a less accurate value correct ?
Randomness in the model is described as :
The images are processed as follows:
They are cropped to 24 x 24 pixels, centrally for evaluation or randomly for training.
They are approximately whitened to make the model insensitive to dynamic range.
For training, we additionally apply a series of random distortions to artificially increase the data set size:
Randomly flip the image from left to right.
Randomly distort the image brightness.
Randomly distort the image contrast.
src : https://www.tensorflow.org/tutorials/deep_cnn
Does this have an effect on training accuracy ?
Update :
Attempting to investigate this further, the loss function value training with different number of GPU's.
Training with 1 GPU : loss value : .7 , Accuracy : 86%
Training with 2 GPU's : loss value : .5 , Accuracy : 84%
Shouldn't the loss value be lower for higher for higher accuracy, not vice versa ?
In the code you linked, using the function average_gradient with 2 GPUs is exactly equivalent (1) to simply using 1 GPU with twice the batch size.
You can see it in the definition:
grad = tf.concat(axis=0, values=grads)
grad = tf.reduce_mean(grad, 0)
Using a larger batch size (given the same number of epochs) can have any kind of effect on your results.
Therefore, if you want to do exactly equivalent (1) calculations in 1-GPU or 2-GPU cases, you may want to halve the batch size in the latter case. (People sometimes avoid doing it, because smaller batch sizes may also make the computation on each GPU slower, in some cases)
Additionally, one needs to be careful with learning rate decay here. If you use it, you want to make sure the learning rate is the same in the nth epoch in both 1-GPU and 2-GPU cases -- I'm not entirely sure this code is doing the right thing here. I tend to print the learning rate in the logs, something like
print sess.run(lr)
should work here.
(1) Ignoring issues related to pseudo-random numbers, finite precision or data set sizes not divisible by the batch size.
There is a decent discussion of this here (not my content). Basically when you distribute SGD, you have to communicate gradients back and forth somehow between workers. This is inherently imperfect, and so your distributed SGD typically diverges from a sequential, single-worker SGD at least to some degree. It is also typically faster, so there is a trade off.
[Zhang et. al., 2015] proposes one method for distributed SGD called elastic-averaged SGD. The paper goes through a stability analysis characterizing the behavior of the gradients under different communication constraints. It gets a little heavy, but it might shed some light on why you see this behavior.
Edit: regarding whether the loss should be lower for the higher accuracy, it is going to depend on a couple of things. First, I am assuming that you are using softmax cross-entropy for your loss (as stated in the deep_cnn tutorial you linked), and assuming accuracy is the total number of correct predictions divided by the total number of samples. In this case, a lower loss on the same dataset should correlate to a higher accuracy. The emphasis is important.
If you are reporting loss during training but then report accuracy on your validation (or testing) dataset, it is possible for these two to be only loosely correlated. This is because the model is fitting (minimizing loss) to a certain subset of your total samples throughout the training process, and then tests against new samples that it has never seen before to verify that it generalizes well. The loss against this testing/validation set could be (and probably is) higher than the loss against the training set, so if the two numbers are being reported from different sets, you may not be able to draw comparisons like "loss for 1 GPU case should be lower since its accuracy is lower".
Second, if you are distributing the training then you are calculating losses across multiple workers (I believe), but only one accuracy at the end, again against a testing or validation set. Maybe the loss being reported is the best loss seen by any one worker, but overall the average losses were higher.
Basically I do not think we have enough information to decisively say why the loss and accuracy do not seem to correlate the way you expect, but there are a number of ways this could be happening, so I wouldn't dismiss it out of hand.
I've also encountered this issue.
See Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour from Facebook which addresses the same issue. The suggested solution is simply to scale up the learning rate by k (after some reasonable warm-up epochs) for k GPUs.
In practice I've found out that simply summing up the gradients from the GPUs (rather than averaging them) and using the original learning rate sometimes does the job as well.