Approximator of Log likelihood of tanh(mean + std*z) - optimization

I have been trying to understand a blog on soft actor critic where we have a neural network representing a policy that outputs mean and std of gaussian distribution of action for a given state. Since direct back-propagation through stochastic node is not possible , reparamterization trick is applied as follows:
`normal = Normal(0, 1)
z = normal.sample()
action = torch.tanh(mean+ std*z.to(device))
log_prob = Normal(mean, std).log_prob(mean+ std*z.to(device)) - torch.log(1 - action.pow(2) + epsilon)
return action, log_prob, z, mean, log_std`
I want to know how the log_prob term was derived. Any help would be highly appreciated.

Related

How do I interpret a quickly converging Q loss/value function loss in actor-critic?

I am researching an application of actor-critic RL in a nonstationary environment and the loss of the Q-network (or, if I also implement a value function network, that loss too) quickly converges to zero, well before the network finds the optimal policy.
The architecture is kind of successful in finding a good policy, even though it is not very robust to perturbations, and I suspect that the Q-loss converging this fast is revealing of its inability to estimate the state-value or value function correctly. The environment being nonstationary makes it even more suspect, since there should always be some degree of error in the estimation. Any ideas as to what might be causing this?
Specifically, I am using soft actor-critic, and my implementation is based on OpenAI's Spinning Up repo. The optimization targets are as described in the paper [0], but I honestly find their code much more understandable - math in RL is usually not rigorous enough to really make sense of it. Anyway these are the expressions for the value function target:
and for the Q-function target:
Where \theta\, \psi and \bar{\psi} are neural networks (Q-function, main value and target value network respectively). I slightly modify these equations to optimize for average reward rate since my task is continuous, see [3], and include entropy regularization when taking the log probability of the action given by policy.
My Q- and value functions are simple MLPs:
# Soft Actor-Critic from OpenAI https://github.com/openai/spinningup/tree/master/spinup/algos/pytorch/sac
def mlp(sizes, activation, output_activation=nn.Identity):
layers = []
for j in range(len(sizes) - 1):
act = activation if j < len(sizes) - 2 else output_activation
layers += [nn.Linear(sizes[j], sizes[j + 1]), act()]
return nn.Sequential(*layers)
class MLPQFunction(nn.Module):
def __init__(self, obs_dim, act_dim, hidden_sizes, activation):
super().__init__()
self.q = mlp([obs_dim + act_dim] + list(hidden_sizes) + [1], activation)
def forward(self, obs, act):
q = self.q(torch.cat([obs, act], dim=-1))
return torch.squeeze(q, -1) # Critical to ensure q has right shape.
class MLPValueFunction(nn.Module):
def __init__(self, obs_dim, hidden_sizes, activation):
super().__init__()
self.v = mlp([obs_dim] + list(hidden_sizes) + [1], activation)
def forward(self, obs):
v = self.v(obs)
return v.squeeze()
and I compute the losses this way, after sampling a tuple of batches (o, a, r, o2) from a replay buffer. Each variable is [batch x dim(S)] if it's an observation, where dim(S) is the dimension of the state space, 2 in my case, or [batch x 1] if it's an action or reward.
q1 = net.q1(o, a)
q2 = net.q2(o, a)
# Bellman backup for Q functions
with torch.no_grad():
# Target actions come from *current* policy
a2, logp_a2 = net.pi(o2)
# Target Q-values
q1_pi_targ = target_net.q1(o2, a2)
q2_pi_targ = target_net.q2(o2, a2)
q_pi_targ = torch.min(q1_pi_targ, q2_pi_targ)
backup = r - avg_reward + (q_pi_targ - temp * logp_a2)
# MSE loss against Bellman backup
loss_q1 = F.smooth_l1_loss(q1, backup)
loss_q2 = F.smooth_l1_loss(q2, backup)
loss_q = loss_q1 + loss_q2
q_optimizer.zero_grad(set_to_none=True)
loss_q.backward()
q_optimizer.step()
# Compute value function loss
v_optimizer.zero_grad(set_to_none=True)
vf = net.v(o)
with torch.no_grad():
vf_target = q_pi - temp * logp_pi
loss_v = F.smooth_l1_loss(vf, vf_target)
loss_v.backward()
v_optimizer.step()
Where avg_reward is estimated as a running average:
avg_reward += AVG_REW_LR * (R - avg_reward + target_net.v(next_state.squeeze()) - target_net.v(state.squeeze()))
[0] Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. Proceedings of the 35th International Conference on Machine Learning, 1861–1870. https://proceedings.mlr.press/v80/haarnoja18b.html
[3] Naik, A., Shariff, R., Yasui, N., Yao, H., & Sutton, R. S. (2019). Discounted Reinforcement Learning Is Not an Optimization Problem. ArXiv:1910.02140 [Cs]. http://arxiv.org/abs/1910.02140

Tensorflow Probability VI: Discrete + Continuous RVs inference: gradient estimation?

See this tensorflow-probability issue
tensorflow==2.7.0
tensorflow-probability==0.14.1
TLDR
To perform VI on discrete RVs, should I use:
A- the REINFORCE gradient estimator
B- the Gumbel-Softmax reparametrization
C- another solution
and how to implement it ?
Problem statement
Sorry in advance for the long issue, but I believe the problem requires some explaining.
I want to implement a Hierarchical Bayesian Model involving both continuous and discrete Random Variables. A minimal example is a Gaussian Mixture model:
import tensorflow as tf
import tensorflow_probability as tfp
tfd = tfp.distributions
tfb = tfp.bijectors
G = 2
p = tfd.JointDistributionNamed(
model=dict(
mu=tfd.Sample(
tfd.Normal(0., 1.),
sample_shape=(G,)
),
z=tfd.Categorical(
probs=tf.ones((G,)) / G
),
x=lambda mu, z: tfd.Normal(
loc=mu[z],
scale=1.
)
)
)
In this example I don't use the tfd.Mixture API on purpose to expose the Categorical label. I want to perform Variational Inference in this context, and for instance given an observed x fit over the posterior of z a Categorical distribution with parametric probabilities:
q_probs = tfp.util.TransformedVariable(
tf.ones((G,)) / G,
tfb.SoftmaxCentered(),
name="q_probs"
)
q_loc = tf.Variable(0., name="q_loc")
q_scale = tfp.util.TransformedVariable(
1.,
tfb.Exp(),
name="q_scale"
)
q = tfd.JointDistributionNamed(
model=dict(
mu=tfd.Normal(q_loc, q_scale),
z=tfd.Categorical(probs=q_probs)
)
)
The issue is: when computing the ELBO and trying to optimize for the optimal q_probs I cannot use the reparameterization gradient estimators: this is AFAIK because z is a discrete RV:
def log_prob_fn(**kwargs):
return p.log_prob(
**kwargs,
x=tf.constant([2.])
)
optimizer = tf.optimizers.SGD()
#tf.function
def fit_vi():
return tfp.vi.fit_surrogate_posterior(
target_log_prob_fn=log_prob_fn,
surrogate_posterior=q,
optimizer=optimizer,
num_steps=10,
sample_size=8
)
_ = fit_vi()
# This last line raises:
# ValueError: Distribution `surrogate_posterior` must be reparameterized, i.e.,a diffeomorphic transformation
# of a parameterless distribution. (Otherwise this function has a biased gradient.)
I'm looking into a way to make this work. I've identified at least 2 ways to circumvent the issue: using REINFORCE gradient estimator or the Gumbel-Softmax reparameterization.
A- REINFORCE gradient
cf this TFP API link a classical result in VI is that the REINFORCE gradient can deal with a non-differentiable objective function, for instance due to discrete RVs.
I can use a tfp.vi.GradientEstimators.SCORE_FUNCTION estimator instead of the tfp.vi.GradientEstimators.REPARAMETERIZATION one using the lower-level tfp.vi.monte_carlo_variational_loss function?
Using the REINFORCE gradient, In only need the log_prob method of q to be differentiable, but the sample method needn't be differentiated.
As far as I understood it, the sample method for a Categorical distribution implies a gradient break, but the log_prob method does not. Am I correct to assume that this could help with my issue? Am I missing something here?
Also I wonder: why is this possibility not exposed in the tfp.vi.fit_surrogate_posterior API ? Is the performance bad, meaning is the variance of the estimator too large for practical purposes ?
B- Gumbel-Softmax reparameterization
cf this TFP API link I could also reparameterize z as a variable y = tfd.RelaxedOneHotCategorical(...) . The issue is: I need to have a proper categorical label to use for the definition of x, so AFAIK I need to do the following:
p_GS = tfd.JointDistributionNamed(
model=dict(
mu=tfd.Sample(
tfd.Normal(0., 1.),
sample_shape=(G,)
),
y=tfd.RelaxedOneHotCategorical(
temperature=1.,
probs=tf.ones((G,)) / G
),
x=lambda mu, y: tfd.Normal(
loc=mu[tf.argmax(y)],
scale=1.
)
)
)
...but his would just move the gradient breaking problem to tf.argmax. This is where I maybe miss something. Following the Gumbel-Softmax (Jang et al., 2016) paper, I could then use the "STRAIGHT-THROUGH" (ST) strategy and "plug" the gradients of the variable tf.one_hot(tf.argmax(y)) -the "discrete y"- onto y -the "continuous y".
But again I wonder: how to do this properly ? I don't want to mix and match the gradients by hand, and I guess an autodiff backend is precisely meant to avoid me this issue. How could I create a distribution that differentiates the forward direction (sampling a "discrete y") from the backward direction (gradient computed using the "continuous y") ? I guess this is the meant usage of the tfd.RelaxedOneHotCategorical distribution, but I don't see this implemented anywhere in the API.
Should I implement this myself ? How ? Could I use something in the lines of tf.custom_gradient?
Actual question
Which solution -A or B or another- is meant to be used in the TFP API, if any? How should I implement said solution efficiently?
So the ides was not to make a Q&A but I looked into this issue for a couple days and here are my conclusions:
solution A -REINFORCE- is a possibility, it doesn't introduce any bias, but as far as I understood it it has high variance in its vanilla form -making it prohibitively slow for most real-world tasks. As detailed a bit below, control variates can help tackle the variance issue;
solution B, Gumbell-Softmax, exists as well in the API, but I did not find any native way to make it work for hierarchical tasks. Below is my implementation.
First off, we need to reparameterize the joint distribution p as the KL between a discrete and a continuous distribution is ill-defined (as explained in the Maddison et al. (2017) paper). To not break the gradients, I implemented a simple one_hot_straight_through operation that converts the continuous RV y into a discrete RV z:
G = 2
#tf.custom_gradient
def one_hot_straight_through(y):
depth = y.shape[-1]
z = tf.one_hot(
tf.argmax(
y,
axis=-1
),
depth=depth
)
def grad(upstream):
return upstream
return z, grad
p = tfd.JointDistributionNamed(
model=dict(
mu=tfd.Sample(
tfd.Normal(0., 1.),
sample_shape=(G,)
),
y=tfd.RelaxedOneHotCategorical(
temperature=1.,
probs=tf.ones((G,)) / G
),
x=lambda mu, y: tfd.Normal(
loc=tf.reduce_sum(
one_hot_straight_through(y)
* mu
),
scale=1.
)
)
)
The variational distribution q follows the same reparameterization and the following code bit does work:
q_probs = tfp.util.TransformedVariable(
tf.ones((G,)) / G,
tfb.SoftmaxCentered(),
name="q_probs"
)
q_loc = tf.Variable(tf.zeros((2,)), name="q_loc")
q_scale = tfp.util.TransformedVariable(
1.,
tfb.Exp(),
name="q_scale"
)
q = tfd.JointDistributionNamed(
model=dict(
mu=tfd.Independent(
tfd.Normal(q_loc, q_scale),
reinterpreted_batch_ndims=1
),
y=tfd.RelaxedOneHotCategorical(
temperature=1.,
probs=q_probs
)
)
)
def log_prob_fn(**kwargs):
return p.log_prob(
**kwargs,
x=tf.constant([2.])
)
optimizer = tf.optimizers.SGD()
#tf.function
def fit_vi():
return tfp.vi.fit_surrogate_posterior(
target_log_prob_fn=log_prob_fn,
surrogate_posterior=q,
optimizer=optimizer,
num_steps=10,
sample_size=8
)
_ = fit_vi()
Now there are several issues with that design:
first off we needed to reparameterize not only q but also p so we "modify our target model". This results in our models p and q not outputing discrete RVs like originally intended but continuous RVs. I think that the introduction of a hard option like in the torch implem could be a nice addition to overcome this issue;
second we introduce the burden of setting up the temperature parameter. The latter make the continuous RV y smoothly converge to its discrete counterpart z. An annealing strategy, reducing the temperature to reduce the bias introduced by the relaxation at the cost of a higher variance can be implemented. Or the temperature can be learned online, akin to an entropy regularization (see Maddison et al. (2017) and Jang et al. (2017));
the gradient obtained with this estimator are biased, which probably can be acceptable for most applications but is an issue in theory.
Recent methods like REBAR (Tucker et al. (2017)) or RELAX (Grathwohl et al. (2018)) can instead obtain unbiased estimators with a lower variance than the original REINFORCE. But they do so at the cost of introducing -learnable- control variates with separate losses. Modifications of the one_hot_straight_through functions could probably implement this.
In conclusion my opinion is that the tensorflow probability support for discrete RVs optimization is too scarce at the moment and that the API lacks native functions and tutorials to make it easier for the user.

Image and mask normalization in semantic segmentation for cancer

Could someone please help me, in semantic segmentation tasks, should the image and mask be normalized in the batch generator class or only one of them should be normalized?
I'm using the following code to normalize image and mask:
mean_val, std_val = img.mean(), img.std()
img = (img - mean_val)/std_val
for example :
here image and corresponding masks are normalized for the prostate cancer segmentation task
while here only the masks are normalized
which one is the correct practice?
def __getitem__(self,i):
index= self.indexes[i * self.batch_size : (i + 1) * self.batch_size]
X = np.empty((self.batch_size, self.crop_dim[0], self.crop_dim[1],3)).astype(np.uint8)
Y = np.empty((self.batch_size, self.crop_dim[0], self.crop_dim[1],5)).astype(np.uint8)
for i,ID in enumerate(index):
dim= (self.crop_dim[0],self.crop_dim[1])
img=cv2.imread(self.img_list[ID],cv2.COLOR_BGR2RGB)
img = cv2.resize(img,dim)
mask=imageio.imread(self.labels[ID],as_gray=False, pilmode="RGB")
mask = cv2.resize(mask,dim)
mask= create_labels(mask)
# Augement training patches only
if self.augmentation:
sample = self.augmentation(image=img_numpy, mask=mask_numpy)
img_numpy, mask_numpy = sample['image'], sample['mask']
mean_val, std_val = img.mean(), img.std()
img = (img - mean_val)/std_val
mean_val_mask, std_val_mask = mask.mean(), mask.std()
mask= (mask - mean_val_mask)/std_val_mask
X[i,]=img_numpy
Y[i,]=mask_numpy
No!
You do not want to normalize the labels - you want to predict them directly, you do not want the target (per pixel) to change based on the global statistics of the mask.
Why normalizing?
It is common practice to normalize the inputs to a DNN model. This is motivated by the desire to control the "dynamic range" of the activations at different layers. This, in turn, helps the optimization process to converge in a more rapid and stable manner.
You can find an in-depth analysis of this normalization in this excellent paper:
He, K., Zhang, X., Ren, S. and Sun, J., Delving deep into rectifiers: Surpassing human-level performance on imagenet classification (ICCV 2015).
This rationale does not apply to the labels.

After quantisation in neural network, will the output need to be scaled with the inverse of the weight scaling

I'm currently writing a script to quantise a Keras model down to 8 bits. I'm doing a fairly basic linear scaling on the weights, by assuming a normal distribution of weights and biases, and then interpolating all the values within 2 standard deviations of the mean, to the range [-128, 127].
This all works, and I run the model through inference, but my image out is crazy bad. I know there will be a small performance hit, but I'm seeing roughly 10x performance degradation.
My question is, after this scaling of the weights, do I need to do the inverse scaling operation to my output? None of the papers I've been reading seem to mention this, but I'm unsure why else my results would be so bad.
The network is for image demosaicing. It takes in a RAW image, and is meant to output an image with very low noise, and no demosaicing artefacts. My full precision model is very good, with image PSNRs of around 40-43dB, but after quantisation, I'm getting 4-8dB, and incredibly bad looking images.
Code for anyone who's bothered to read it
for i in layer_index:
count = count+1
layer = model.get_layer(index = i);
weights = layer.get_weights();
weights_act = weights[0];
bias_act = weights[1];
std = np.std(weights_act)
if (std > max_std):
max_std = std
mean = np.mean(weights_act)
mean_of_mean = mean_of_mean + mean
mean_of_mean = mean_of_mean / count
max_bound = mean_of_mean + 2*max_std
min_bound = mean_of_mean - 2*max_std
print(max_bound, min_bound)
for i in layer_index:
layer = model.get_layer(index = i);
weights = layer.get_weights();
weights_act = weights[0];
bias_act = weights[1];
weights_shape = weights_act.shape;
bias_shape = bias_act.shape;
new_weights = np.empty(weights_shape, dtype = np.int8)
print(new_weights.dtype)
new_biass = np.empty(bias_shape, dtype = np.int8)
for a in range(weights_shape[0]):
for b in range(weights_shape[1]):
for c in range(weights_shape[2]):
for d in range(weights_shape[3]):
new_weight = (((weights_act[a,b,c,d] - min_bound) * (127 - (-128)) / (max_bound - min_bound)) + (-128))
new_weights[a,b,c,d] = np.int8(new_weight)
#print(new_weights[a,b,c,d], weights_act[a,b,c,d])
for e in range(bias_shape[0]):
new_bias = (((bias_act[e] - min_bound) * (127 - (-128)) / (max_bound - min_bound)) + (-128))
new_biass[e] = np.int8(new_bias)
new_weight_layer = (new_weights, new_biass)
layer.set_weights(new_weight_layer)
You dont do what you think you are doing, I'll explain.
If you wish to take pre-trained model and quantize it you have to add scales after each operation that involves weights, lets take for example the convolution operation.
As we know convolution operation is linear in my explantion i will ignore the bias for the sake of simplicity (adding him is relatively easy), Let's assume X is our input Y is our output and W is the weights, convolution can be written as:
Y=W*X
where '*' represent the convolution operation, what you are basically doing is taking the weights and multiple them by some scalar (lets call it 'a') and shift them by some other scalar (let's call it 'b') so in your model you use W' where: W'= Wa+b
So if we return to the convolution operation we get that in your quantized network you basically do the next operation: Y' = W'*X = (Wa+b)*X
Because convolution is linear we get: Y' = a(W*X) + b*X'
Don't forget that in your network you want to receive Y not Y' at the output of the convolution therefore you must do shift + re scale to get the correct answer.
So after that explanation (which i hope was clear enough) i hope you can understand what is the problem in your network, you do this scale and shift to all of weights and you never compensate for it, I think your confusion is because your read papers that trained models in quantized mode from the beginning and didn't take pretrained model quantized it.
For you problem i think tensorflow graph transform tool might help, take a look at:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/graph_transforms/README.md
If you wish to read more about quantizing pre trained model you can find more information in (for more academic info just go to scholar.google.com:
https://www.tensorflow.org/lite/performance/post_training_quantization

Normalized Mutual Information in Tensorflow

Is that possible to implement normalized mutual information in Tensorflow? I was wondering if I can do that and if I will be able to differentiate it. Let's say that I have predictions P and labels Y in two different tensors. Is there an easy way to use normalized mutual information?
I want to do something similar to this:
https://course.ccs.neu.edu/cs6140sp15/7_locality_cluster/Assignment-6/NMI.pdf
Assume your clustering method gives probability predictions/membership functions p(c|x), e.g., p(c=1|x) is the probability of x in the first cluster. Assume y is the ground truth class label for x.
The normalized mutual information is .
The entropy H(Y) can be estimated following this thread: https://stats.stackexchange.com/questions/338719/calculating-clusters-entropy-python
By definition, the entropy H(C) is , where .
The conditional mutual information where , and .
All terms involving integral can be estimated using sampling, i.e., average over training samples. The overall NMI is differentiable.
I did not misunderstand your question. I was assuming you used a neural network model which outputs logits as you did not provide any info. Then you need to normalise the logits to get p(c|x).
There may be other ways to estimate NMI, but if you discretize the output of whatever model you use, you cannot differentiate them.
TensorFlow code
Assume we have label matrix p_y_on_x and cluster predictions p_c_on_x. Each row of them corresponds to an observation x; each column corresponds to the probability of x in each class and cluster (so each row sums up to one). Further assume uniform probability for p(x) and p(x|y).
Then NMI can then be estimated as below:
p_y = tf.reduce_sum(p_y_on_x, axis=0, keepdim=True) / num_x # 1-by-num_y
h_y = -tf.reduce_sum(p_y * tf.math.log(p_y))
p_c = tf.reduce_sum(p_c_on_x, axis=0) / num_x # 1-by-num_c
h_c = -tf.reduce_sum(p_c * tf.math.log(p_c))
p_x_on_y = p_y_on_x / num_x / p_y # num_x-by-num_y
p_c_on_y = tf.matmul(p_c_on_x, p_x_on_y, transpose_a=True) # num_c-by-num_y
h_c_on_y = -tf.reduce_sum(tf.reduce_sum(p_c_on_y * tf.math.log(p_c_on_y), axis=0) * p_y)
i_y_c = h_c - h_c_on_y
nmi = 2 * i_y_c / (h_y + h_c)
In practice, please be very careful on the probabilities as they should be positive to avoid numeric overflow in tf.math.log.
Please comment if you find any mistakes.