Can Tensorflow take gradient on matrix 2-norm? - tensorflow

Normally the matrix norm we took in Tensorflow is Frobenius norm which is easy to compute and easy to understand, e.g., a Bayesian view. But in many cases, it is the largest singular value matters. It is possible to optimize that in Tensorflow? It depends on whether tensorflow can take gradient with respect to matrix 2-norm.

Actually, the spectral norm is equal the largest singular value. To get to this value you can use TensorFlow's linalg.svd.

Related

How to debug exploding gradient (covariance matrix) in Tensorflow 2.0 (TFP)

A question that comes from the fact that I never had to debug my models in TF so deeply.
I'm running a variational inference with a full-rank Gaussian approximation using Tensorflow Probability. I noticed my optimization often explodes. Here is my loss curve.
I suspect numerical issues, as all the losses and the optimization process look reasonable and I don't observe any NaNs.
I use tfp.distributions.MultivariateNormalTriL with a covariance parameter transformed by tfp.bijectors.FillScaleTriL with the default diagonal shift. The condition number of the covariance matrix is reasonable. The variational inference is performed with fit_surrogate_posterior function.
I optimize with an SGD with momentum, using 10 samples per iteration.
Internally in Tensorflow Probability source code, the minimization objective uses a gradient tape:
with tf.GradientTape(watch_accessed_variables=trainable_variables is None) as tape:
for v in trainable_variables or []:
tape.watch(v)
loss = loss_fn()
In order to solve my issue I would like to see the gradient through every operation.
My question is how can I get more insight into which operation is exploding by the gradient computation? How to get the value of gradient at every tensor?
And if any of you faced a similar issue:
Is there a better way to prevent instabilities in the covariance matrix optimization?
Detailed explanations:
I observed that this explosion is caused by one parameter (though it is not always the same parameter that explodes). This can be simply checked by comparing the covariance matrix two iterations before the explosion
and one iteration before the point where the loss explodes
Note the last parameter. When I run the same optimization multiple times, it might happen that one of the "small" parameters (rows from 9 to the last) explodes at some point.
Thanks,
Mateusz

Multiple questions regarding the KL term in the ELBO loss with TensorFlow Probability

I have been trying to conduct a few experiments using TensorFlow Probability (TFP), and I got a few questions.
What is the proper value of the coefficient of the KL loss?
In the paper by Blundell (2015), the coefficient is set to 1/M (where M is the number of mini-batches). In the example given by TFP, the coefficient is given as 1/mnist_data.train.num_examples. Why?
As I go from 2d input to 3d images volumes, the KL loss is still significantly larger (~1k) than the cross-entropy (~1), even after dividing by mnist_data.train.num_examples. Why?
What is the guideline for getting a proper value for this coefficient? Maybe like the two-loss terms should be the same order of magnitude?
The current coefficient only takes care of the number of training samples, but not the network complexity or number of parameters in the network, which I assume the KL loss increase with the complexity of the model.
I am trying to implement a neural network with the KL loss, without using keras.model.losses, as some software production and hardware support limitation. I am trying to train my model with TF 1.10 and TFP 0.3.0., the issue is that for tf<=1.14, tf.keras.model does not support tf.layers inside the Keras model, so I can't use my original model straight away. Is there a way to get the KL loss, not from model.losses, but from layers or weights of the network in a TF construct?
Is batch normalization or group normalization still helpful in Bayesian deep learning?
In the paper by Blundell (2015), the coefficient is set to 1/M (where M is the number of mini-batches). In the example given by TFP, the coefficient is given as 1/mnist_data.train.num_examples. Why?
In the BBB paper eq. 8, they refer to M being the number of mini-batches. To be consistent with the non-stochastic gradient learning, it should be scaled by the number of mini-batches which is what is done by Graves. Another alternative is that done in eq. 9, where they scale it by \pi_i, where the sum of all the values in the set {\pi} sum to one.
In the TFP example, it does look like the num_examples is the total number of independent samples within the training set, which is much larger than the number of batches. This is goes by a few names, such as Safe Bayes or Tempering. Have a look at sec. 8 of this paper for some more discussion about the use of tempering within Bayesian inference and it's suitability.
As I go from 2d input to 3d images volumes, the KL loss is still significantly larger (~1k) than the cross-entropy (~1), even after dividing by mnist_data.train.num_examples. Why?
The ELBO will always be larger than just your cross-entropy (which defines your likelihood). Have a look at how the KL divergence term in the ELBO is found. (and a full mean-field approach where each weight/parameter is assumed to be independent).
Since the assumed posterior is factorised (assume each parameter is independent), can write the joint distribution as a product. This means when you take the log when you are computing the KL between the approx. posterior and the prior, you can write it as a sum of the KL terms between each parameter. Since the KL is >= 0, for each parameter you add to your model you will be adding another positive term to your ELBO. This is likely why your loss is so much more for your 3D model, likely because there is more parameters.
Another reason this could occur is if you have less data (your M is smaller, than the KL term is weighted less).
What is the guideline for getting a proper value for this coefficient? Maybe like the two-loss terms should be the same order of magnitude?
I am unsure of any specific guideline, for training you are interested primarily in the gradients. A large loss does not mean a large gradient. Have a look at the gradients contributed by the negative log likelihood and the KL term in your ELBO. If the KL term is too large, you probably need a more informative prior or more data (you could simply scale the KL term but this feels a bit yucky for the Bayesian in me).
The current coefficient only takes care of the number of training samples, but not the network complexity or the number of parameters in the network, which I assume the KL loss increase with the complexity of the model.
Yes, as stated before, in general, more parameters == greater ELBO (for a mean-field approach as used in Bayes by Backprop).
I am trying to implement a neural network with the KL loss, without using keras.model.losses, as some software production and hardware support limitation. I am trying to train my model with TF 1.10 and TFP 0.3.0., the issue is that for tf<=1.14, tf.keras.model does not support tf.layers inside the Keras model, so I can't use my original model straight away. Is there a way to get the KL loss, not from model.losses, but from layers or weights of the network in a TF construct?
I am unsure about the best way to tackle this part of it. I would be cautious about going to older versions where it isn't explicitly supported. They put those warnings/exceptions in for a reason.
Is batch normalization or group normalization still helpful in Bayesian deep learning?
For variational inference (as done in Bayes by Backprop) Batchnorm is fine. For sampling methods such as MCMC, Batch normalization is no longer suitable. Have a look at https://arxiv.org/pdf/1908.03491v1.pdf for info on suitability for batch norm with sampling methods for approx. Bayesian inference.

sampled_softmax_loss vs negative sampling

I am working on text autoencoder so want to use negative sampling for training our model. I want to know the difference between negative sampling and sampled softmax.
Thanks in advance
https://www.tensorflow.org/extras/candidate_sampling.pdf
Accoring to tensorflow negative sampling relates to logistic loss while sampled softmax relates to softmax.
Both of them, at the core, pick a sample of negative examples to compute the loss on and update gradients.
For your model, use it if your output is very large (many classes) AND the regular loss is too slow to compute. If the output has few classes there's not much gain. If the training is fast anyway, why bother with approximations.

Anyway to backprob derivatives when derivatives of the custom loss function are calculated by myself

I have been using tensorflow to train deep NN acoustic models for speech recognition for a while. The loss function I use is Cross Entropy and the NN models performe very well. Now I want to change the loss function to a more complex one named MMI (Maximum Mutual Information) which is also a classical criterion used in speech recognition domain. I put one paper here which describes this loss function in case that you have interests.
When using this special loss function, the derivatives of the loss function w.r.t. the activations of output layer can be computed by some special algorithms defined in Hidden Markov Model scenario. It means that I can compute the derivatives of the loss function w.r.t. the activations of output layer by myself rather than just write out the loss function and leave Tensorflow to calculate the derivatives automatically.
But based on my poor experiences, I don't know how to backprob the derivatives which I calculate by myself. Is there any way to do this without touching Tensorflow C++ source code?
Probably yes if all the computation involved use existing tensorflow functions.
You just have to set up the chain of operations that compute the gradients from the current variables.
Then you just use tf.assign_add() to the variables with your gradients multiplied by minus the learning rate.
You are thus mimicking what happens in the background in TF usually.
EDIT: If calculations are made in numpy for instance for the gradients you can use.
#perform numpy calculations
a=f(output_npy,variables_npy)
grad_from_user=tf.placeholder(tf.float32, a.shape)
grad_update=tf.assign_add(variables_tf,-lr*grad_from_user)
#and then
sess.run(grad_update,feed_dict={grad_from_user:a,...})

what is the difference between sampled_softmax_loss and nce_loss in tensorflow?

i notice there are two functions about negative Sampling in tensorflow to compute the loss (sampled_softmax_loss and nce_loss). the paramaters of these two function are similar, but i really want to know what is the difference between the two?
Sample softmax is all about selecting a sample of the given number and try to get the softmax loss. Here the main objective is to make the result of the sampled softmax equal to our true softmax. So algorithm basically concentrate lot on selecting the those samples from the given distribution.
On other hand NCE loss is more of selecting noise samples and try to mimic the true softmax. It will take only one true class and a K noise classes.
Sampled softmax tries to normalise over all samples in your output. Having a non-normal distribution (logarithmic over your labels) this is not an optimal loss function. Note that although they have the same parameters, they way you use the function is different. Take a look at the documentation here: https://github.com/calebchoo/Tensorflow/blob/master/tensorflow/g3doc/api_docs/python/functions_and_classes/shard4/tf.nn.nce_loss.md and read this line:
By default this uses a log-uniform (Zipfian) distribution for sampling, so your labels must be sorted in order of decreasing frequency to achieve good results. For more details, see log_uniform_candidate_sampler.
Take a look at this paper where they explain why they use it for word embeddings: http://papers.nips.cc/paper/5165-learning-word-embeddings-efficiently-with-noise-contrastive-estimation.pdf
Hope this helps!
Check out this documentation from TensorFlow https://www.tensorflow.org/extras/candidate_sampling.pdf
They seem pretty similar, but sampled softmax is only applicable for a single label while NCE extends to the case where your labels are a multiset. NCE can then model the expected counts rather than presence/absence of a label. I'm not clear on an exact example of when to use the sampled_softmax.