Does the automatic differentiation procedure in TensorFlow compute subgradient whenever needed? If there are many subgradients then which one will be chosen as output?
I am trying to implement the paper in the link which uses recursive neural networks to perform efficient language parsing. The objective function uses hinge loss function to pick the optimal output vectors, which makes the function not differentiable. I used TensorFlow (v1.12) in eager mode to program the model and used the automatic differentiation to compute the gradients. After every batch, I could see the gradient values changing and the accuracy is slightly improved. After a while, it decreases and this process continues. The model does not converge at all for all the hyper-parameter configurations.
Mini batch size : 256, 512, 1024; Regularization parameters - 0.1, 0.01, 0.001; Learning rate - 0.1, 0.01, 0.001; Optimization function - gradient descent, adagrad, adam;
In the paper, they have described how to find subgradient for the optimum function in a very abstract manner, which I have not understood yet. I was of the opinion at the beginning that automatic gradient computation calculates the subgradient. But at this moment, I am starting to doubt so because that seems to be the only variable missing.

Unfortunately, Tensorflow does not computes subgradients, only gradients.
As explained here How does tensorflow handle non differentiable nodes during gradient calculation? .
To summarize, when computing a partial derivative, if there is a problem of differentiability, Tensorflow simply puts this derivative to be zero.
As for you having trouble training your model, there are no general rules saying how to tune the hyperparameters, thus, I would suggest to do a grid search on the learning rates (on a few epochs) to find a good initial learning rate which provide good results for one of the optimization algorithms. Usually, ADAM or SGD with momentum provide satisfying results when choosing a right initial learning rate.


What loss function to use in Keras when metric is SparseTopKCategoricalAccuracy/TopKCategoricalAccuracy?

For multiclass classification problems, Keras and tf.keras have metrics like SparseTopKCategoricalAccuracy and TopKCategoricalAccuracy. However, if one uses loss functions like SparseCategoricalCrossentropy or CategoricalCrossentropy, they cannot achieve the max values for these two metrics.
What is a good loss function to use when one wants to maximize SparseTopKCategoricalAccuracy or TopKCategoricalAccuracy?
I understand that SparseTopKCategoricalAccuracy is not differentiable, just like Accuracy. I am trying to find a function that can approximate the smooth loss function and yield a higher number for SparseTopKCategoricalAccuracy.
CrossEntropy is not the best loss function when you deal with Top-k accuracy because cross-entropy may be prone to overfitting on small datasets or noisy labels.
As you have already pointed out, "smooth loss" functions are developed for top-k classification with SVM. To my knowledge, there is no a "off-the-shelf" loss function in Keras/TF that is best suited for top-k. However, I suggest you to try Smooth Surrogate Loss (SSL) presented in the article and implemented in Pytorch to use with deep neural networks (see Github). It derives from multi-class SVMs as SSL creates a margin between the correct top-k predictions and the incorrect ones. The training time of SSL is comparatevely the same as in the case of cross-entropy thanking to a divide-and-conquer approach and the use of polynomials (see implementation).

Multiple questions regarding the KL term in the ELBO loss with TensorFlow Probability

I have been trying to conduct a few experiments using TensorFlow Probability (TFP), and I got a few questions.
What is the proper value of the coefficient of the KL loss?
In the paper by Blundell (2015), the coefficient is set to 1/M (where M is the number of mini-batches). In the example given by TFP, the coefficient is given as 1/mnist_data.train.num_examples. Why?
As I go from 2d input to 3d images volumes, the KL loss is still significantly larger (~1k) than the cross-entropy (~1), even after dividing by mnist_data.train.num_examples. Why?
What is the guideline for getting a proper value for this coefficient? Maybe like the two-loss terms should be the same order of magnitude?
The current coefficient only takes care of the number of training samples, but not the network complexity or number of parameters in the network, which I assume the KL loss increase with the complexity of the model.
I am trying to implement a neural network with the KL loss, without using keras.model.losses, as some software production and hardware support limitation. I am trying to train my model with TF 1.10 and TFP 0.3.0., the issue is that for tf<=1.14, tf.keras.model does not support tf.layers inside the Keras model, so I can't use my original model straight away. Is there a way to get the KL loss, not from model.losses, but from layers or weights of the network in a TF construct?
Is batch normalization or group normalization still helpful in Bayesian deep learning?
In the BBB paper eq. 8, they refer to M being the number of mini-batches. To be consistent with the non-stochastic gradient learning, it should be scaled by the number of mini-batches which is what is done by Graves. Another alternative is that done in eq. 9, where they scale it by \pi_i, where the sum of all the values in the set {\pi} sum to one.
In the TFP example, it does look like the num_examples is the total number of independent samples within the training set, which is much larger than the number of batches. This is goes by a few names, such as Safe Bayes or Tempering. Have a look at sec. 8 of this paper for some more discussion about the use of tempering within Bayesian inference and it's suitability.
The ELBO will always be larger than just your cross-entropy (which defines your likelihood). Have a look at how the KL divergence term in the ELBO is found. (and a full mean-field approach where each weight/parameter is assumed to be independent).
Since the assumed posterior is factorised (assume each parameter is independent), can write the joint distribution as a product. This means when you take the log when you are computing the KL between the approx. posterior and the prior, you can write it as a sum of the KL terms between each parameter. Since the KL is >= 0, for each parameter you add to your model you will be adding another positive term to your ELBO. This is likely why your loss is so much more for your 3D model, likely because there is more parameters.
Another reason this could occur is if you have less data (your M is smaller, than the KL term is weighted less).
I am unsure of any specific guideline, for training you are interested primarily in the gradients. A large loss does not mean a large gradient. Have a look at the gradients contributed by the negative log likelihood and the KL term in your ELBO. If the KL term is too large, you probably need a more informative prior or more data (you could simply scale the KL term but this feels a bit yucky for the Bayesian in me).
Yes, as stated before, in general, more parameters == greater ELBO (for a mean-field approach as used in Bayes by Backprop).
I am unsure about the best way to tackle this part of it. I would be cautious about going to older versions where it isn't explicitly supported. They put those warnings/exceptions in for a reason.
For variational inference (as done in Bayes by Backprop) Batchnorm is fine. For sampling methods such as MCMC, Batch normalization is no longer suitable. Have a look at for info on suitability for batch norm with sampling methods for approx. Bayesian inference.

What is the purpose of the Tensorflow Gradient Tape?

I watched the Tensorflow Developer's summit video on Eager Execution in Tensorflow, and the presenter gave an introduction to "Gradient Tape." Now I understand that Gradient Tape tracks the automatic differentiation that occurs in a TF model.
I was trying to understand why I would use Gradient Tape? Can anyone explain how Gradient Tape is used as a diagnostic tool? Why would someone use Gradient Tape versus just Tensorboard visualization of weights.
So I get that the automatic differentiation that occurs with a model is to compute the gradients of each node--meaning the adjustment of the weights and biases at each node, given some batch of data. So that is the learning process. But I was under the impression that I can actually use a tf.keras.callback.TensorBoard() call to see the tensorboard visualization of training--so I can watch the weights on each node and determine if there are any dead or oversaturated nodes.
Is the use of Gradient Tape only to see if some gradients go to zero or get really big, etc? Or is there some other use of the Gradient Tape?
With eager execution enabled, Tensorflow will calculate the values of tensors as they occur in your code. This means that it won't precompute a static graph for which inputs are fed in through placeholders. This means to back propagate errors, you have to keep track of the gradients of your computation and then apply these gradients to an optimiser.
This is very different from running without eager execution, where you would build a graph and then simply use to evaluate your loss and then pass this into an optimiser directly.
Fundamentally, because tensors are evaluated immediately, you don't have a graph to calculate gradients and so you need a gradient tape. It is not so much that it is just used for visualisation, but more that you cannot implement a gradient descent in eager mode without it.
Obviously, Tensorflow could just keep track of every gradient for every computation on every tf.Variable. However, that could be a huge performance bottleneck. They expose a gradient tape so that you can control what areas of your code need the gradient information. Note that in non-eager mode, this will be statically determined based on the computational branches that are descendants of your loss but in eager mode there is no static graph and so no way of knowing.
Having worked on this for a while, after posting the initial question, I have a better sense of where Gradient Tape is useful. Seems like the most useful application of Gradient Tap is when you design a custom layer in your keras model for example--or equivalently designing a custom training loop for your model.
If you have a custom layer, you can define exactly how the operations occur within that layer, including the gradients that are computed and also calculating the amount of loss that is accumulated.
So Gradient tape will just give you direct access to the individual gradients that are in the layer.
Here is an example from Aurelien Geron's 2nd edition book on Tensorflow.
Say you have a function you want as your activation.
def f(w1, w2):
return 3 * w1 ** 2 + 2 * w1 * w2
Now if you want to take derivatives of this function with respec to w1 and w2:
w1, w2 = tf.Variable(5.), tf.Variable(3.)
with tf.GradientTape() as tape:
z = f(w1, w2)
gradients = tape.gradient(z, [w1, w2])
So the optimizer will calculate the gradient and give you access to those values. Then you can double them, square them, triple them, etc., whatever you like. Whatever you choose to do, then you can add those adjusted gradients to the loss calculation for the backpropagation step, etc.
I think the most important thing to say in answer to this question is simply that GradientTape is not a diagnostic tool. That's the misconception here.
GradientTape is a mathematical tool for automatic differentiation (autodiff), which is the core functionality of TensorFlow. It does not "track" the autodiff, it is a key part of performing the autodiff.
As the other answers describe, it is used to record ("tape") a sequence of operations performed upon some input and producing some output, so that the output can be differentiated with respect to the input (via backpropagation / reverse-mode autodiff) (in order to then perform gradient descent optimisation).

Anyway to backprob derivatives when derivatives of the custom loss function are calculated by myself

I have been using tensorflow to train deep NN acoustic models for speech recognition for a while. The loss function I use is Cross Entropy and the NN models performe very well. Now I want to change the loss function to a more complex one named MMI (Maximum Mutual Information) which is also a classical criterion used in speech recognition domain. I put one paper here which describes this loss function in case that you have interests.
When using this special loss function, the derivatives of the loss function w.r.t. the activations of output layer can be computed by some special algorithms defined in Hidden Markov Model scenario. It means that I can compute the derivatives of the loss function w.r.t. the activations of output layer by myself rather than just write out the loss function and leave Tensorflow to calculate the derivatives automatically.
But based on my poor experiences, I don't know how to backprob the derivatives which I calculate by myself. Is there any way to do this without touching Tensorflow C++ source code?
Probably yes if all the computation involved use existing tensorflow functions.
You just have to set up the chain of operations that compute the gradients from the current variables.
Then you just use tf.assign_add() to the variables with your gradients multiplied by minus the learning rate.
You are thus mimicking what happens in the background in TF usually.
EDIT: If calculations are made in numpy for instance for the gradients you can use.
#perform numpy calculations
grad_from_user=tf.placeholder(tf.float32, a.shape)
#and then,feed_dict={grad_from_user:a,...})

Tensorflow optimizers: loss sum vs mean

I'm wondering if the Tensorflow optimizers (in particular the AdamOptimizer) have a preference when it comes to defining a loss function as a sum or as a mean/average over a minibatch?
In general my assumption was that using the mean is preferred, because the loss does not depend with the size of the mini batches. Thus, it is easier to find a learning rate which works with any batch size.
However, Tensorflow defines e.g. l2_loss internally as:
output = sum(t ** 2) / 2
Does this imply that the optimizers account for the batch size internally already, i.e., they expect losses to scale linearly with the batch size? Also, what's the motivation of taking half the L2 norm from the perspective of optimization?
Well here l2_loss is actually a regularization loss function. We add that inside our main loss functions inorder to prevent the parameters from over fitting. We normally divide the l2 loss by 2 inorder to make it easy when taking the gradients.
And inside any optimizer we take the average loss w.r.t batch size.