Implement Kullback-Leibler as loss function for arbitrary distributions - tensorflow

I had already implemented an optimization (gradient descent) algorithm by using the Tensorflow-probability built-in KL-Divergence as a loss function. Theoretically it worked well, but then I found out that the list of registered distributions, which you are able to compare in the KL-Divergence, is quite limited. I tried to minimize the KL-Divergence of a Gaussian Mixture Model (as the true distribution) and a Normal Distribution (optimize Mean and Std, such that KL-Divergence becomes minimal), which was not possible.
So I tried to implement my own approach, which did not work:
import numpy as np
from scipy.stats import norm
import tensorflow as tf
The idea I had was to create densities of the needed distributions via scipy.stats (lets say Normal distributions) and transform the density-variables to Tensors:
x = np.arange(-10,10,0.001)
mu_train = tf.Variable(2.0)
p_pdf = norm.pdf(x, 0, 1)
q_pdf = norm.pdf(x, mu_train,1)
p = tf.convert_to_tensor(p_pdf)
q = tf.convert_to_tensor(q_pdf, dtype=tf.float64)
Now I defined the KL-Divergence as a function that only depends on q.
def kl_loss(q):
return tf.reduce_sum(
tf.where(p == 0, tf.zeros(p.shape, tf.float64), p * tf.math.log(p / q))
)
Then I calculated the gradient of kl_loss with respect to mu_train, but the output I get from this a "None".
with tf.GradientTape() as tape:
tape.watch(mu_train)
loss = kl_loss(q)
d_loss_d_mu = tape.gradient(loss, mu_train)
print(d_loss_d_mu)
Now that I have thought about it.. to get a "None" as output makes sense to me, since kl_loss(q) is a function that does only depend on the values "q(x)", that are generated by the density q, but it does not depend on mu_train directly, since this is just a parameter of the Normal Distribution but the input for the kl_loss is an array/tensor of values of the normal distribution..
Does anyone know how I can find a workaround for this or does anyone else have a completely different solution to get the KL-Divergence as a loss function with arbitrary distributions, such that I can compute gradients with respect to parameters and run a GradientDescent Minimizer.

Related

sampling within GradientTape

I am computing the different terms of the ELBO and its expectations to illustrate and get a better grasp of the reparametrization trick, as nicely explained here under undifferentiable expectations.
As a simplified example in this journey, I have a random variable $z$ distributed as follows
$$z \sim \mathcal{N}(\mu=0, \sigma=t)$$
where $t$ is a parameter that I want to optimize (like MLE). I take one sample $z_i$ from the distribution and I want to compute the gradient of the probability of $z_i$ with respect to $t$, at some given value of $t=t_{\text{value}}$
$$\nabla_t p_z(z_i)$$
I use a gradient tape for this and a tfd.Normal object to compute the pdf. I must build the object within the gradient tape so that and I can optimize with respect to $t$ upon which the distribution depends. Therefore I must sample within the gradient tape.
Reparametrization trick aside, when I sample from the same distribution object I get a different gradient as compared when I sample from an equivalent scipy.stats object (therefore, not related to the computational graph).
In fact, the gradients computed in the second case (using scipy.stats) correspond to the ones computed with sympy differentiation. And there is definitely a relation between the gradients obtained.
Note that I am just computing the gradients over a single data item each time, not computing expectations.
Clearly, there is some extra dependency introduced in the computational graph by sampling and this affects the gradients. Is this something expected, or not, or am I just doing something too weird?
import sympy as sy
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import tensorflow_probability as tfp
import tensorflow as tf
tfd = tfp.distributions
def sample_gradients(sample_from_tf=True, tvalue=1.5):
tvar = tf.Variable(tvalue)
grads, samples = [], []
# do 100 times to paint the gradients later
for _ in range(100):
with tf.GradientTape() as tape:
dz = tfd.Normal(loc=0, scale=tvar)
if sample_from_tf:
sample = dz.sample(1)
else:
sample = stats.norm(loc=0, scale=tvar.numpy()).rvs(1)
loss = dz.prob(sample)
grad = tape.gradient(loss, tvar)
grads.append(grad.numpy())
samples.append(sample[0])
return grads, samples
# prepare for computing grads of pdf with sympy
t,z = sy.symbols(r't z')
ptz = 1/(t*sy.sqrt(2*sy.pi))*sy.exp(-(z/t)**2/2) # gaussian pdf in sympy
# a chosen value for t
tvalue = 1.5
# compute gradients sampling from the same tfd distribution
# and plot them compared to sympy
grads, samples = sample_gradients(sample_from_tf=True)
sgrads = [ptz.diff(t).subs({t: tvalue, z: zi}).n() for zi in samples]
plt.figure()
plt.scatter(sgrads, grads)
plt.title(r"sampling from the same tfd object $\rightarrow$ DIFFERENT");
plt.xlabel("symbolic gradients"); plt.ylabel("gradients from TF")
# compute gradients sampling from the equivalen distribution in scipy.stats
# and plot them compared to sympy
grads, samples = sample_gradients(sample_from_tf=False)
sgrads = [ptz.diff(t).subs({t: tvalue, z: zi}).n() for zi in samples]
plt.figure()
plt.scatter(grads, sgrads)
plt.title(r"sampling from scipy.stats $\rightarrow$ EQUAL");
plt.xlabel("symbolic gradients"); plt.ylabel("gradients from TF")

Wrong gradient from tf custom gradient - even though gradient is implemented using the inbuilt Jacobian

I'm trying to write a wrapper around a model, such that the tf model can be called as a function of its weights (and input). However this wrapper returns different gradients than the gradients fromt the original model. Details in the code below (including a colab notebook to reproduce directly), but at the core I'm using the custom gradient decorator - the respective gradient is computed directly as the upstream 'gradient' matmul (via tensordot) the respective jacobian.
To make this clear: I'm computing the gradient for a model, once directly, once by using my custom wrapper. In both cases the parameters in the model are the same. The Jacobian is implemented by TF, so nothing should be wrong there. Still the resulting gradient seems to be wrong.
I'm not sure, whether this is a coding mistake I made somewhere, or possibly just a numeric problem stemming from the Jacobian matmul - however my tests regarding correlation of the gradients suggest this is more than a numeric issue for now. Code of the function is provided below, a link to colab notebook reproducing the problem can be found here: Colab Notebook reproducing the problem
Why: This is important for a bunch of metalearning, which I'm trying to build a small library for currently.
My current 'wrapper' looks something like this:
#calls model on input x but replaces internal weights with the weights argument
#critically supposed to compute the respective gradient for the weights tensor argument!
def call_model_with_weights(model, x, weights, dim_output=2):
#tf.custom_gradient
def _call_with_weights(x_and_w):
x, weights = x_and_w
#be careful; this assigns weights to the model as a side effect, can ignore for dummy version
ctrls = [var.assign(val) for var, val in zip(model.trainable_weights, weights)]
with tf.control_dependencies(ctrls):
with tf.GradientTape() as tape:
y = model(x)
jacobians = tape.jacobian(y, model.trainable_weights)
def grad(upstream, variables):
assert len(variables)==len(weights)
#gradient for each weight should be upstream dotproduct respective jacobian
dy_dw = [tf.tensordot(upstream, j, axes=[list(range(dim_output)), list(range(dim_output))]) for j in jacobians]
dy_dw_weights = dy_dw
return (None, dy_dw_weights), [None for _ in dy_dw] # returning x as derivative of x is wrong, but not important here rn
return y, grad
y = _call_with_weights((x, weights))
return y
Thanks a lot for any help (including how this could be done in a more elegant way), helping out means you are contributing to package that plans to mimic PyTorch 'higher' for TF which I hope helps some more people <3

tf.keras.layers.BatchNormalization with trainable=False appears to not update its internal moving mean and variance

I am trying to find out, how exactly does BatchNormalization layer behave in TensorFlow. I came up with the following piece of code which to the best of my knowledge should be a perfectly valid keras model, however the mean and variance of BatchNormalization doesn't appear to be updated.
From docs https://www.tensorflow.org/api_docs/python/tf/keras/layers/BatchNormalization
in the case of the BatchNormalization layer, setting trainable = False on the layer means that the layer will be subsequently run in inference mode (meaning that it will use the moving mean and the moving variance to normalize the current batch, rather than using the mean and variance of the current batch).
I expect the model to return a different value with each subsequent predict call.
What I see, however, are the exact same values returned 10 times.
Can anyone explain to me why does the BatchNormalization layer not update its internal values?
import tensorflow as tf
import numpy as np
if __name__ == '__main__':
np.random.seed(1)
x = np.random.randn(3, 5) * 5 + 0.3
bn = tf.keras.layers.BatchNormalization(trainable=False, epsilon=1e-9)
z = input = tf.keras.layers.Input([5])
z = bn(z)
model = tf.keras.Model(inputs=input, outputs=z)
for i in range(10):
print(x)
print(model.predict(x))
print()
I use TensorFlow 2.1.0
Okay, I found the mistake in my assumptions. The moving average is being updated during training not during inference as I thought. This makes perfect sense, as updating the moving averages during inference would likely result in an unstable production model (for example a long sequence of highly pathological input samples [e.g. such that their generating distribution differs drastically from the one on which the network was trained] could potentially bias the network and result in worse performance on valid input samples).
The trainable parameter is useful when you're fine-tuning a pretrained model and want to freeze some of the layers of the network even during training. Because when you call model.predict(x) (or even model(x) or model(x, training=False)), the layer automatically uses the moving averages instead of batch averages.
The code below demonstrates this clearly
import tensorflow as tf
import numpy as np
if __name__ == '__main__':
np.random.seed(1)
x = np.random.randn(10, 5) * 5 + 0.3
z = input = tf.keras.layers.Input([5])
z = tf.keras.layers.BatchNormalization(trainable=True, epsilon=1e-9, momentum=0.99)(z)
model = tf.keras.Model(inputs=input, outputs=z)
# a dummy loss function
model.compile(loss=lambda x, y: (x - y) ** 2)
# a dummy fit just to update the batchnorm moving averages
model.fit(x, x, batch_size=3, epochs=10)
# first predict uses the moving averages from training
pred = model(x).numpy()
print(pred.mean(axis=0))
print(pred.var(axis=0))
print()
# outputs the same thing as previous predict
pred = model(x).numpy()
print(pred.mean(axis=0))
print(pred.var(axis=0))
print()
# here calling the model with training=True results in update of moving averages
# furthermore, it uses the batch mean and variance as in training,
# so the result is very different
pred = model(x, training=True).numpy()
print(pred.mean(axis=0))
print(pred.var(axis=0))
print()
# here we see again that the moving averages are used but they differ slightly after
# the previous call, as expected
pred = model(x).numpy()
print(pred.mean(axis=0))
print(pred.var(axis=0))
print()
In the end, I found that the documentation (https://www.tensorflow.org/api_docs/python/tf/keras/layers/BatchNormalization) mentions this:
When performing inference using a model containing batch normalization, it is generally (though not always) desirable to use accumulated statistics rather than mini-batch statistics. This is accomplished by passing training=False when calling the model, or using model.predict.
Hopefully this will help someone with similar misunderstanding in the future.

Weak optimizers in Pytorch

Consider a simple line fitting a * x + b = x, where a, b are the optimized parameters and x is the observed vector given by
import torch
X = torch.randn(1000,1,1)
One can immediately see that the exact solution is a=1, b=0 for any x and it can be found as easily as:
import numpy as np
np.polyfit(X.numpy().flatten(), X.numpy().flatten(), 1)
I am trying now to find this solution by means of gradient descent in PyTorch, where the mean square error is used as an optimization criterion.
import matplotlib.pyplot as plt
import numpy as np
import torch
import torch.nn as nn
from torch.optim import Adam, SGD, Adagrad, ASGD
X = torch.randn(1000,1,1) # Sample data
class SimpleNet(nn.Module): # Trivial neural network containing two weights
def __init__(self):
super(SimpleNet, self).__init__()
self.f1 = nn.Linear(1,1)
def forward(self, x):
x = self.f1(x)
return x
# Testing default setting of 3 basic optimizers
K = 500
net = SimpleNet()
optimizer = Adam(params=net.parameters())
Adam_losses = []
optimizer.zero_grad() # zero the gradient buffers
for k in range(K):
for b in range(1): # single batch
loss = torch.mean((net.forward(X[b,:,:]) - X[b,:, :])**2)
loss.backward()
optimizer.step()
Adam_losses.append(float(loss.detach()))
net = SimpleNet()
optimizer = SGD(params=net.parameters(), lr=0.0001)
SGD_losses = []
optimizer.zero_grad() # zero the gradient buffers
for k in range(K):
for b in range(1): # single batch
loss = torch.mean((net.forward(X[b,:,:]) - X[b,:, :])**2)
loss.backward()
optimizer.step()
SGD_losses.append(float(loss.detach()))
net = SimpleNet()
optimizer = Adagrad(params=net.parameters())
Adagrad_losses = []
optimizer.zero_grad() # zero the gradient buffers
for k in range(K):
for b in range(1): # single batch
loss = torch.mean((net.forward(X[b,:,:]) - X[b,:, :])**2)
loss.backward()
optimizer.step()
Adagrad_losses.append(float(loss.detach()))
The training progress in terms of loss evolution can be shown as
What is surprising for me is a very slow convergence of the algorithms in default setting. I have thus 2 questions:
1) Is it possible to achieve an arbitrary small error (loss) purely by means of some Pytorch optimizer? Since the loss function is convex, it should be definitely possible, however, I am not able to figure out, how to achieve this using PyTorch. Note that the above 3 optimizers cannot do that - see the loss progress in log scale for 20000 iterations:
2) I am wondering how the optimizers can work well in complex examples, when they does not work well even in this extremely simple example. Or (and that is the second question) is it something wrong in their application above that I missed?
The place where you called zero_grad is wrong. During each epoch, gradient is added to the previous one and backpropagated. This makes the loss oscillate as it gets closer, but previous gradient throws it off of the solution again.
Code below will easily perform the task:
import torch
X = torch.randn(1000,1,1)
net = SimpleNet()
optimizer = Adam(params=net.parameters())
for epoch in range(EPOCHS):
optimizer.zero_grad() # zero the gradient buffers
loss = torch.mean((net.forward(X) - X) ** 2)
if loss < 1e-8:
print(epoch, loss)
break
loss.backward()
optimizer.step()
1) Is it possible to achieve an arbitrary small error (loss) purely by
means of some Pytorch optimizer?
Yeah, precision above is reached in around ~1500 epochs, you can go lower up to the machine (float in this case) precision
2) I am wondering how the optimizers can work well in complex
examples, when they does not work well even in this extremely simple
example.
Currently, we don't have anything better (at least wide spread) for network optimization than first order methods. Those are used as it's much faster to calculate gradient than Hessians for higher order methods. And complex, non-convex functions may have a lot of minima which kinda fulfill the task we threw at it, there is no need for global minima per se (although they may under some conditions, see this paper).

area=pi*r^2 : what loss/optimizer function to use, and why does the below loss function does not predict?

By giving radius r and area of circle, I want the NN to predict correct values. However the below code did not predict. What change I need to make in loss/optimize function? Would be great if you provide some reasoning for choosing loss/optimize function.
from tensorflow import keras
import numpy as np
model = keras.Sequential([keras.layers.Dense(units=1, input_shape=[1])])
model.compile(optimizer='sgd', loss='mean_squared_error')
radiusTrainValues = np.array([1.0,2.0,3.0, 4.0, 5.0, 6.0], dtype=float)
areaTrainValues = np.array([3.14159,12.56637, 28.27433,50.26548,78.53982,113.09734], dtype=float)
model.fit(radiusTrainValues, areaTrainValues, epochs=5000)
radiusTestVlues = np.array([7.0,8.0,9.0, 10.0], dtype=float)
areaTestVlues = np.array([153.93804,201.06193,254.469,314.15927], dtype=float)
print("Input :",radiusTestVlues)
print("CorrectVlues :",areaTestVlues)
print("TF Predicted:",model.predict(radiusTestVlues))
Actually The problem that you have, I believe, does not come from your loss function. Because your loss function is just an indicator of your prediction performance the optimizer will just adjust model parameter in the direction of gradient of your loss respect to each parameter (dLoss/dW). What happen here is you want to use a NN to approximate a function that calculate area f(r) = Pi * r^2 using only 1 neuron which is just f(r) = (W * r) + B. In simple word, you are trying to approximate a parabola function (r^2) using a linear function (W * r + B), therefore your loss will decrease until some point and stuck because that is the best it can do, you can try drawing that yourself and you will see there will be a huge gap between your line and parabola.
What you can do is increase number of layer and neuron you will see a huge improvement.