sampling within GradientTape - tensorflow

I am computing the different terms of the ELBO and its expectations to illustrate and get a better grasp of the reparametrization trick, as nicely explained here under undifferentiable expectations.
As a simplified example in this journey, I have a random variable $z$ distributed as follows
$$z \sim \mathcal{N}(\mu=0, \sigma=t)$$
where $t$ is a parameter that I want to optimize (like MLE). I take one sample $z_i$ from the distribution and I want to compute the gradient of the probability of $z_i$ with respect to $t$, at some given value of $t=t_{\text{value}}$
$$\nabla_t p_z(z_i)$$
I use a gradient tape for this and a tfd.Normal object to compute the pdf. I must build the object within the gradient tape so that and I can optimize with respect to $t$ upon which the distribution depends. Therefore I must sample within the gradient tape.
Reparametrization trick aside, when I sample from the same distribution object I get a different gradient as compared when I sample from an equivalent scipy.stats object (therefore, not related to the computational graph).
In fact, the gradients computed in the second case (using scipy.stats) correspond to the ones computed with sympy differentiation. And there is definitely a relation between the gradients obtained.
Note that I am just computing the gradients over a single data item each time, not computing expectations.
Clearly, there is some extra dependency introduced in the computational graph by sampling and this affects the gradients. Is this something expected, or not, or am I just doing something too weird?
import sympy as sy
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import tensorflow_probability as tfp
import tensorflow as tf
tfd = tfp.distributions
def sample_gradients(sample_from_tf=True, tvalue=1.5):
tvar = tf.Variable(tvalue)
grads, samples = [], []
# do 100 times to paint the gradients later
for _ in range(100):
with tf.GradientTape() as tape:
dz = tfd.Normal(loc=0, scale=tvar)
if sample_from_tf:
sample = dz.sample(1)
else:
sample = stats.norm(loc=0, scale=tvar.numpy()).rvs(1)
loss = dz.prob(sample)
grad = tape.gradient(loss, tvar)
grads.append(grad.numpy())
samples.append(sample[0])
return grads, samples
# prepare for computing grads of pdf with sympy
t,z = sy.symbols(r't z')
ptz = 1/(t*sy.sqrt(2*sy.pi))*sy.exp(-(z/t)**2/2) # gaussian pdf in sympy
# a chosen value for t
tvalue = 1.5
# compute gradients sampling from the same tfd distribution
# and plot them compared to sympy
grads, samples = sample_gradients(sample_from_tf=True)
sgrads = [ptz.diff(t).subs({t: tvalue, z: zi}).n() for zi in samples]
plt.figure()
plt.scatter(sgrads, grads)
plt.title(r"sampling from the same tfd object $\rightarrow$ DIFFERENT");
plt.xlabel("symbolic gradients"); plt.ylabel("gradients from TF")
# compute gradients sampling from the equivalen distribution in scipy.stats
# and plot them compared to sympy
grads, samples = sample_gradients(sample_from_tf=False)
sgrads = [ptz.diff(t).subs({t: tvalue, z: zi}).n() for zi in samples]
plt.figure()
plt.scatter(grads, sgrads)
plt.title(r"sampling from scipy.stats $\rightarrow$ EQUAL");
plt.xlabel("symbolic gradients"); plt.ylabel("gradients from TF")

Related

Implement Kullback-Leibler as loss function for arbitrary distributions

I had already implemented an optimization (gradient descent) algorithm by using the Tensorflow-probability built-in KL-Divergence as a loss function. Theoretically it worked well, but then I found out that the list of registered distributions, which you are able to compare in the KL-Divergence, is quite limited. I tried to minimize the KL-Divergence of a Gaussian Mixture Model (as the true distribution) and a Normal Distribution (optimize Mean and Std, such that KL-Divergence becomes minimal), which was not possible.
So I tried to implement my own approach, which did not work:
import numpy as np
from scipy.stats import norm
import tensorflow as tf
The idea I had was to create densities of the needed distributions via scipy.stats (lets say Normal distributions) and transform the density-variables to Tensors:
x = np.arange(-10,10,0.001)
mu_train = tf.Variable(2.0)
p_pdf = norm.pdf(x, 0, 1)
q_pdf = norm.pdf(x, mu_train,1)
p = tf.convert_to_tensor(p_pdf)
q = tf.convert_to_tensor(q_pdf, dtype=tf.float64)
Now I defined the KL-Divergence as a function that only depends on q.
def kl_loss(q):
return tf.reduce_sum(
tf.where(p == 0, tf.zeros(p.shape, tf.float64), p * tf.math.log(p / q))
)
Then I calculated the gradient of kl_loss with respect to mu_train, but the output I get from this a "None".
with tf.GradientTape() as tape:
tape.watch(mu_train)
loss = kl_loss(q)
d_loss_d_mu = tape.gradient(loss, mu_train)
print(d_loss_d_mu)
Now that I have thought about it.. to get a "None" as output makes sense to me, since kl_loss(q) is a function that does only depend on the values "q(x)", that are generated by the density q, but it does not depend on mu_train directly, since this is just a parameter of the Normal Distribution but the input for the kl_loss is an array/tensor of values of the normal distribution..
Does anyone know how I can find a workaround for this or does anyone else have a completely different solution to get the KL-Divergence as a loss function with arbitrary distributions, such that I can compute gradients with respect to parameters and run a GradientDescent Minimizer.

Learning a Categorical Variable with TensorFlow Probability

I would like to use TFP to write a neural network where the output are the probabilities of a categorical variable with 3 classes, and train it using the negative log-likelihood.
As I'm moving my first steps with TF and TFP, I started with a toy model where the input layer has only 1 unit receiving a null input, and the output layer has 3 units with softmax activation function. The idea is that the biases should learn (up to an additive constant) the log of the probabilities.
Here below is my code, true_p are the true parameters I use to generate the data and I would like to learn, while learned_p is what I get from the NN.
import numpy as np
import tensorflow as tf
from tensorflow import keras
from functions import nll
from tensorflow.keras.optimizers import SGD
import tensorflow.keras.layers as layers
import tensorflow_probability as tfp
tfd = tfp.distributions
# params
true_p = np.array([0.1, 0.7, 0.2])
n_train = 1000
# training data
x_train = np.array(np.zeros(n_train)).reshape((n_train,))
y_train = np.array(np.random.choice(len(true_p), size=n_train, p=true_p)).reshape((n_train,))
# model
input_layer = layers.Input(shape=(1,))
p_layer = layers.Dense(len(true_p), activation=tf.nn.softmax)(input_layer)
p_y = tfp.layers.DistributionLambda(tfd.Categorical)(p_layer)
model_p = keras.models.Model(inputs=input_layer, outputs=p_y)
model_p.compile(SGD(), loss=nll)
# training
hist_p = model_p.fit(x=x_train, y=y_train, batch_size=100, epochs=3000, verbose=0)
# check result
learned_p = np.round(model_p.layers[1].call(tf.constant([0], shape=(1, 1))).numpy(), 3)
learned_p
With this setup, I get the result:
>>> learned_p
array([[0.005, 0.989, 0.006]], dtype=float32)
I over-estimate the second category, and can't really distinguish between the first and the third one. What's worst, if I plot the probabilities at the end of each epoch, it looks like they are converging monotonically to the vector [0,1,0], which doesn't make sense (it seems to me the gradient should push in the opposite direction once I start to over-estimate).
I really can't figure out what's going on here, but have the feeling I'm doing something plain wrong. Any idea? Thank you for your help!
For the record, I also tried using other optimizers like Adam or Adagrad playing with the hyper-params, but with no luck.
I'm using Python 3.7.9, TensorFlow 2.3.1 and TensorFlow probability 0.11.1
I believe the default argument to Categorical is not the vector of probabilities, but the vector of logits (values you'd take softmax of to get probabilities). This is to help maintain precision in internal Categorical computations like log_prob. I think you can simply eliminate the softmax activation function and it should work. Please update if it doesn't!
EDIT: alternatively you can replace the tfd.Categorical with
lambda p: tfd.Categorical(probs=p)
but you'll lose the aforementioned precision gains. Just wanted to clarify that passing probs is an option, just not the default.

tf.keras.layers.BatchNormalization with trainable=False appears to not update its internal moving mean and variance

I am trying to find out, how exactly does BatchNormalization layer behave in TensorFlow. I came up with the following piece of code which to the best of my knowledge should be a perfectly valid keras model, however the mean and variance of BatchNormalization doesn't appear to be updated.
From docs https://www.tensorflow.org/api_docs/python/tf/keras/layers/BatchNormalization
in the case of the BatchNormalization layer, setting trainable = False on the layer means that the layer will be subsequently run in inference mode (meaning that it will use the moving mean and the moving variance to normalize the current batch, rather than using the mean and variance of the current batch).
I expect the model to return a different value with each subsequent predict call.
What I see, however, are the exact same values returned 10 times.
Can anyone explain to me why does the BatchNormalization layer not update its internal values?
import tensorflow as tf
import numpy as np
if __name__ == '__main__':
np.random.seed(1)
x = np.random.randn(3, 5) * 5 + 0.3
bn = tf.keras.layers.BatchNormalization(trainable=False, epsilon=1e-9)
z = input = tf.keras.layers.Input([5])
z = bn(z)
model = tf.keras.Model(inputs=input, outputs=z)
for i in range(10):
print(x)
print(model.predict(x))
print()
I use TensorFlow 2.1.0
Okay, I found the mistake in my assumptions. The moving average is being updated during training not during inference as I thought. This makes perfect sense, as updating the moving averages during inference would likely result in an unstable production model (for example a long sequence of highly pathological input samples [e.g. such that their generating distribution differs drastically from the one on which the network was trained] could potentially bias the network and result in worse performance on valid input samples).
The trainable parameter is useful when you're fine-tuning a pretrained model and want to freeze some of the layers of the network even during training. Because when you call model.predict(x) (or even model(x) or model(x, training=False)), the layer automatically uses the moving averages instead of batch averages.
The code below demonstrates this clearly
import tensorflow as tf
import numpy as np
if __name__ == '__main__':
np.random.seed(1)
x = np.random.randn(10, 5) * 5 + 0.3
z = input = tf.keras.layers.Input([5])
z = tf.keras.layers.BatchNormalization(trainable=True, epsilon=1e-9, momentum=0.99)(z)
model = tf.keras.Model(inputs=input, outputs=z)
# a dummy loss function
model.compile(loss=lambda x, y: (x - y) ** 2)
# a dummy fit just to update the batchnorm moving averages
model.fit(x, x, batch_size=3, epochs=10)
# first predict uses the moving averages from training
pred = model(x).numpy()
print(pred.mean(axis=0))
print(pred.var(axis=0))
print()
# outputs the same thing as previous predict
pred = model(x).numpy()
print(pred.mean(axis=0))
print(pred.var(axis=0))
print()
# here calling the model with training=True results in update of moving averages
# furthermore, it uses the batch mean and variance as in training,
# so the result is very different
pred = model(x, training=True).numpy()
print(pred.mean(axis=0))
print(pred.var(axis=0))
print()
# here we see again that the moving averages are used but they differ slightly after
# the previous call, as expected
pred = model(x).numpy()
print(pred.mean(axis=0))
print(pred.var(axis=0))
print()
In the end, I found that the documentation (https://www.tensorflow.org/api_docs/python/tf/keras/layers/BatchNormalization) mentions this:
When performing inference using a model containing batch normalization, it is generally (though not always) desirable to use accumulated statistics rather than mini-batch statistics. This is accomplished by passing training=False when calling the model, or using model.predict.
Hopefully this will help someone with similar misunderstanding in the future.

Weak optimizers in Pytorch

Consider a simple line fitting a * x + b = x, where a, b are the optimized parameters and x is the observed vector given by
import torch
X = torch.randn(1000,1,1)
One can immediately see that the exact solution is a=1, b=0 for any x and it can be found as easily as:
import numpy as np
np.polyfit(X.numpy().flatten(), X.numpy().flatten(), 1)
I am trying now to find this solution by means of gradient descent in PyTorch, where the mean square error is used as an optimization criterion.
import matplotlib.pyplot as plt
import numpy as np
import torch
import torch.nn as nn
from torch.optim import Adam, SGD, Adagrad, ASGD
X = torch.randn(1000,1,1) # Sample data
class SimpleNet(nn.Module): # Trivial neural network containing two weights
def __init__(self):
super(SimpleNet, self).__init__()
self.f1 = nn.Linear(1,1)
def forward(self, x):
x = self.f1(x)
return x
# Testing default setting of 3 basic optimizers
K = 500
net = SimpleNet()
optimizer = Adam(params=net.parameters())
Adam_losses = []
optimizer.zero_grad() # zero the gradient buffers
for k in range(K):
for b in range(1): # single batch
loss = torch.mean((net.forward(X[b,:,:]) - X[b,:, :])**2)
loss.backward()
optimizer.step()
Adam_losses.append(float(loss.detach()))
net = SimpleNet()
optimizer = SGD(params=net.parameters(), lr=0.0001)
SGD_losses = []
optimizer.zero_grad() # zero the gradient buffers
for k in range(K):
for b in range(1): # single batch
loss = torch.mean((net.forward(X[b,:,:]) - X[b,:, :])**2)
loss.backward()
optimizer.step()
SGD_losses.append(float(loss.detach()))
net = SimpleNet()
optimizer = Adagrad(params=net.parameters())
Adagrad_losses = []
optimizer.zero_grad() # zero the gradient buffers
for k in range(K):
for b in range(1): # single batch
loss = torch.mean((net.forward(X[b,:,:]) - X[b,:, :])**2)
loss.backward()
optimizer.step()
Adagrad_losses.append(float(loss.detach()))
The training progress in terms of loss evolution can be shown as
What is surprising for me is a very slow convergence of the algorithms in default setting. I have thus 2 questions:
1) Is it possible to achieve an arbitrary small error (loss) purely by means of some Pytorch optimizer? Since the loss function is convex, it should be definitely possible, however, I am not able to figure out, how to achieve this using PyTorch. Note that the above 3 optimizers cannot do that - see the loss progress in log scale for 20000 iterations:
2) I am wondering how the optimizers can work well in complex examples, when they does not work well even in this extremely simple example. Or (and that is the second question) is it something wrong in their application above that I missed?
The place where you called zero_grad is wrong. During each epoch, gradient is added to the previous one and backpropagated. This makes the loss oscillate as it gets closer, but previous gradient throws it off of the solution again.
Code below will easily perform the task:
import torch
X = torch.randn(1000,1,1)
net = SimpleNet()
optimizer = Adam(params=net.parameters())
for epoch in range(EPOCHS):
optimizer.zero_grad() # zero the gradient buffers
loss = torch.mean((net.forward(X) - X) ** 2)
if loss < 1e-8:
print(epoch, loss)
break
loss.backward()
optimizer.step()
1) Is it possible to achieve an arbitrary small error (loss) purely by
means of some Pytorch optimizer?
Yeah, precision above is reached in around ~1500 epochs, you can go lower up to the machine (float in this case) precision
2) I am wondering how the optimizers can work well in complex
examples, when they does not work well even in this extremely simple
example.
Currently, we don't have anything better (at least wide spread) for network optimization than first order methods. Those are used as it's much faster to calculate gradient than Hessians for higher order methods. And complex, non-convex functions may have a lot of minima which kinda fulfill the task we threw at it, there is no need for global minima per se (although they may under some conditions, see this paper).

How do I plot a non-linear model using matplotlib?

I'm a bit lost as to how to proceed to achieve this. Normally with a linear model, when I perform linear regressions, I simply take my training data (x) and and my output data (y) and plot them using matplotlib. Now I have 3 features with and my output/observation (y). Can anyone guide me as to how to graph this kind of model using matplotlib? My goal is to fit a polynomial model and graph a polynomial using matplotlib.
%matplotlib inline
import sframe as frame
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
# Initalize SFrame
sales = frame.SFrame('kc_house_data.gl/')
# Separate data into test and training data
train_data,test_data = sales.random_split(.8,seed=0)
# Organize data into training and testing data
train_x = train_data[['sqft_living', 'bedrooms', 'bathrooms']].to_dataframe().values
train_y = train_data[['price']].to_dataframe().values
test_x = test_data[['sqft_living', 'bedrooms', 'bathrooms']].to_dataframe().values
test_y = test_data[['price']].to_dataframe().values
# Create a model using sklearn with multiple features
regr = linear_model.LinearRegression(fit_intercept=True, n_jobs=2)
# test predictions
regr.predict(train_x)
# Prepare to plot the data
Note:
The train_x variable contains my 3 features, and my train_y contains the output data. I use SFrame to contain the data. SFrame has the ability to convert itself into a dataframe (used in Pandas). Using the conversion I am able to grab the values.
Rather than plotting a non-linear model with multiple discrete features at once, I have found that simply observing each and every feature against my observation/output was better and easier for my research.