Tensorflow eager execution - nested gradient tape - tensorflow

I have been testing my WGAN-GP algorithm on TensorFlow using the traditional graph implementation. Recently, I happened to get to know TensorFlow Eager Execution and tried to convert my code to run on the eager execution mode.
Let me first show you the previous code:
self.x_ = self.g_net(self.z)
self.d = self.d_net(self.x, reuse=False)
self.d_ = self.d_net(self.x_)
self.d_loss = tf.reduce_mean(self.d) - tf.reduce_mean(self.d_)
epsilon = tf.random_uniform([], 0.0, 1.0)
x_hat = epsilon * self.x + (1 - epsilon) * self.x_
d_hat = self.d_net(x_hat)
ddx = tf.gradients(d_hat, x_hat)[0]
ddx = tf.sqrt(tf.reduce_sum(tf.square(ddx), axis=1))
ddx = tf.reduce_mean(tf.square(ddx - 1.0) * scale)
self.d_loss = self.d_loss + ddx
self.d_adam = tf.train.AdamOptimizer().minimize(self.d_loss, var_list=self.d_net.vars)
Then it is converted to:
self.x_ = self.g_net(self.z)
epsilon = tf.random_uniform([], 0.0, 1.0)
x_hat = epsilon * self.x + (1 - epsilon) * self.x_
with tf.GradientTape(persistent=True) as temp_tape:
temp_tape.watch(x_hat)
d_hat = self.d_net(x_hat)
ddx = temp_tape.gradient(d_hat, x_hat)[0]
ddx = tf.sqrt(tf.reduce_sum(tf.square(ddx), axis=1))
ddx = tf.reduce_mean(tf.square(ddx - 1.0) * 10)
with tf.GradientTape() as d_tape:
d = self.d_net(x)
d_ = self.d_net(x_)
loss_d = tf.reduce_mean(d) - tf.reduce_mean(d_) + ddx
grad_d = d_tape.gradient(loss_d, self.d_net.variables)
self.d_adam.apply_gradients(zip(grad_d, self.d_net.variables))
I tried several alternative ways to implement WGAN-GP loss, but in any way the d_loss is diverging! I hope there is someone who could enlighten me by pointing out my mistake(s).
Furthermore, I wonder whether I could use Keras layers with my previous loss and optimizer implementation. Thank you in advance!

Related

how to implement moving max (and min) calculation in the customized tf2.keras layer

During the training procedure, I want to calculate the moving maximum(and minimum) values of a batch feature maps and then I will implement quantization alogrithm based on the moving max (or min) values. For example: moving_max = (1-momentum)x(previous moving_max) + momentum x (current max value of a batch).
I implement the following codes based on the customized tf2.keras.layer:
from tensorflow.keras.layers import Layer
class QATQuantizerLayer(Layer):
def __init__(self, num_bits, momentum=0.01, **kwargs):
super(QATQuantizerLayer, self).__init__(**kwargs)
self.num_bits = num_bits
self.momentum = momentum
self.num_flag = 0
self.quant_min_val = 0
self.quant_max_val = (1 << self.num_bits) - 1
self.quant_range = float(self.quant_max_val - self.quant_min_val)
def build(self, input_shape):
self.moving_min = self.add_weight("moving_min", shape=(1,), initializer=tf.constant_initializer(-6), trainable=False)
self.moving_max = self.add_weight("moving_max", shape=(1,), initializer=tf.constant_initializer(6), trainable=False)
return super(QATQuantizerLayer, self).build(input_shape)
def call(self, inputs, training, **kwargs):
if training is None:
training = False
if training == True:
batch_min = tf.reduce_min(inputs)
batch_max = tf.reduce_max(inputs)
if self.num_flag == 0:
self.num_flag += 1
self.moving_min = batch_min
self.moving_max = batch_max
else:
temp_min = (1 - self.momentum) * self.moving_min + self.momentum * batch_min
temp_max = (1 - self.momentum) * self.moving_max + self.momentum * batch_max
self.moving_min = temp_min
self.moving_max = temp_max
float_range = self.moving_max - self.moving_min
scale = float_range / self.quant_range
scale = tf.maximum(scale, tf.keras.backend.epsilon())
zero_point = tf.math.round(self.moving_min / scale)
output = (tf.clip_by_value(_round_imp(inputs / scale) - zero_point,
self.quant_min_val, self.quant_max_val) + zero_point) * scale
return output
However, when I start to train I get the following errors:
TypeError: An op outside of the function building code is being passed a "Graph" tensor. It is possible to have Graph tensors leak out of the function building context by including a tf.init_scope in your function building code. For example, the following function will fail:......
If I change the following statement: [temp_min = (1 - self.momentum) * self.moving_min + self.momentum * batch_min] to [temp_min = (1 - self.momentum) + self.momentum * batch_min], the error is disappeared. (That is, remove self.moving_min from the statement)
How can I solve this problem?
Thank you very much.

I want train a set of weight using pytorch, but the weights do not even change

I want to reproduce a method from a paper, the code in this paper was written in tensorflow1.0 and I want to rewrite it in pytorch. A brief description, I want to get a set of G that can be used to reweight input data but in training, the G doesn't even change, this is the tensorflow code:
n,p = X_input.shape
n_e, p_e = X_encoder_input.shape
display_step = 100
X = tf.placeholder("float", [None, p])
X_encoder = tf.placeholder("float", [None, p_e])
G = tf.Variable(tf.ones([n,1]))
loss_balancing = tf.constant(0, tf.float32)
for j in range(1,p+1):
X_j = tf.slice(X_encoder, [j*n,0],[n,p_e])
I = tf.slice(X, [0,j-1],[n,1])
balancing_j = tf.divide(tf.matmul(tf.transpose(X_j),G*G*I),tf.maximum(tf.reduce_sum(G*G*I),tf.constant(0.1))) - tf.divide(tf.matmul(tf.transpose(X_j),G*G*(1-I)),tf.maximum(tf.reduce_sum(G*G*(1-I)),tf.constant(0.1)))
loss_balancing += tf.norm(balancing_j,ord=2)
loss_regulizer = (tf.reduce_sum(G*G)-n)**2 + 10*(tf.reduce_sum(G*G-1))**2#
loss = loss_balancing + 0.0001*loss_regulizer
optimizer = tf.train.RMSPropOptimizer(learning_rate).minimize(loss)
saver = tf.train.Saver()
sess = tf.Session()
sess.run(tf.global_variables_initializer())
and this is my rewriting pytorch code:
n, p = x_test.shape
loss_balancing = torch.tensor(0.0)
G = nn.Parameter(torch.ones([n,1]))
optimizer = torch.optim.RMSprop([G] , lr=0.001)
for i in range(num_steps):
for j in range(1, p+1):
x_j = x_all_encoder[j * n : j*n + n , :]
I = x_test[0:n , j-1:j]
balancing_j = torch.divide(torch.matmul(torch.transpose(x_j,0,1) , G * G * I) ,
torch.maximum( (G * G * I).sum() ,
torch.tensor(0.1) -
torch.divide(torch.matmul(torch.transpose(x_j,0,1) ,G * G * (1-I)),
torch.maximum( (G*G*(1-I)).sum() , torch.tensor(0.1) )
)
)
)
loss_balancing += nn.Parameter(torch.norm(balancing_j))
loss_regulizer = nn.Parameter(((G * G) - n).sum() ** 2 + 10 * ((G * G - 1).sum()) ** 2)
loss = nn.Parameter( loss_balancing + 0.0001 * loss_regulizer )
optimizer.zero_grad()
loss.backward()
optimizer.step()
if i % 100 == 0:
print('Loss:{:.4f}'.format(loss.item()))
and the G.grad = None, I want to know how to get the G a set of value by iteration to minimize the Loss , Thanks.
Firstly, please provide a minimal reproducible example. It will be very helpful for people to answer your question.
Since G.grad has no value, it indicates that loss.backward() didn't properly work.
The computation of gradient can be disturbed by many factors, but in this case, I suspect the maximum operation in your code prevents the backward flow since the maximum operation is not differentiable in general.
To check if this hypothesis is correct, you could check the gradient of a tensor created after the maximum operation which I can't do because provided code is not executable in my case.

Adam custom implementation by PyTorch

I'm trying to code my own implementation of Adam optimization algorithm, but when I try to find the optimum for function f(x,y) = xx + yy, method generates an unexpected output.
Here is the code and graph for each point on Adam's path and more simple algorithm - SGD's path.
class optimizer:
def __init__(self, params):
self.parameters = list(params)
def zero_grad(self):
for param in self.parameters: # Have to be an iter object.
try:
param.grad.zero_()
except:
pass
def step(self):
pass
class Adam(optimizer):
def __init__(self, params, lr, beta1=0.9, beta2=0.999):
self.parameters = list(params)
self.lr = lr
self.beta1 = beta1
self.beta2 = beta2
self.EMA1 = [torch.zeros_like(param) for param in self.parameters]
self.EMA2 = [torch.zeros_like(param) for param in self.parameters]
self.iter_num = 0
self.eps = 1e-9
def step(self):
self.iter_num += 1
correct1 = 1 - self.beta1**self.iter_num # EMA1 bias correction.
correct2 = 1 - self.beta2**self.iter_num # EMA2 bias correction.
with torch.no_grad():
for param, EMA1, EMA2 in zip(self.parameters, self.EMA1, self.EMA2):
EMA1.set_((1 - self.beta1) * param.grad + self.beta1 * EMA1)
EMA2.set_((1 - self.beta2) * (param.grad**2) + self.beta2 * EMA2)
numenator = EMA1 / correct1
denominator = (EMA2 / correct2).sqrt() + self.eps
param -= self.lr * numenator / denominator

How does Tensorflow Batch Normalization work?

I'm using tensorflow batch normalization in my deep neural network successfully. I'm doing it the following way:
if apply_bn:
with tf.variable_scope('bn'):
beta = tf.Variable(tf.constant(0.0, shape=[out_size]), name='beta', trainable=True)
gamma = tf.Variable(tf.constant(1.0, shape=[out_size]), name='gamma', trainable=True)
batch_mean, batch_var = tf.nn.moments(z, [0], name='moments')
ema = tf.train.ExponentialMovingAverage(decay=0.5)
def mean_var_with_update():
ema_apply_op = ema.apply([batch_mean, batch_var])
with tf.control_dependencies([ema_apply_op]):
return tf.identity(batch_mean), tf.identity(batch_var)
mean, var = tf.cond(self.phase_train,
mean_var_with_update,
lambda: (ema.average(batch_mean), ema.average(batch_var)))
self.z_prebn.append(z)
z = tf.nn.batch_normalization(z, mean, var, beta, gamma, 1e-3)
self.z.append(z)
self.bn.append((mean, var, beta, gamma))
And it works fine both for training and testing phases.
However I encounter problems when I try to use the computed neural network parameters in my another project, where I need to compute all the matrix multiplications and stuff by myself. The problem is that I can't reproduce the behavior of the tf.nn.batch_normalization function:
feed_dict = {
self.tf_x: np.array([range(self.x_cnt)]) / 100,
self.keep_prob: 1,
self.phase_train: False
}
for i in range(len(self.z)):
# print 0 layer's 1 value of arrays
print(self.sess.run([
self.z_prebn[i][0][1], # before bn
self.bn[i][0][1], # mean
self.bn[i][1][1], # var
self.bn[i][2][1], # offset
self.bn[i][3][1], # scale
self.z[i][0][1], # after bn
], feed_dict=feed_dict))
# prints
# [-0.077417567, -0.089603029, 0.000436493, -0.016652612, 1.0055743, 0.30664611]
According to the formula on the page https://www.tensorflow.org/versions/r1.2/api_docs/python/tf/nn/batch_normalization:
bn = scale * (x - mean) / (sqrt(var) + 1e-3) + offset
But as we can see,
1.0055743 * (-0.077417567 - -0.089603029)/(0.000436493^0.5 + 1e-3) + -0.016652612
= 0.543057
Which differs from the value 0.30664611, computed by Tensorflow itself.
So what am I doing wrong here and why I can't just calculate batch normalized value myself?
Thanks in advance!
The formula used is slightly different from:
bn = scale * (x - mean) / (sqrt(var) + 1e-3) + offset
It should be:
bn = scale * (x - mean) / (sqrt(var + 1e-3)) + offset
The variance_epsilon variable is supposed to scale with the variance, not with sigma, which is the square-root of variance.
After the correction, the formula yields the correct value:
1.0055743 * (-0.077417567 - -0.089603029)/((0.000436493 + 1e-3)**0.5) + -0.016652612
# 0.30664642276945747

Avoiding optimization pitfalls when modeling an ordinal predicted variable in PyMC3

I am trying to model an ordinal predicted variable using PyMC3 based on the approach in chapter 23 of Doing Bayesian Data Analysis. I would like to determine a good starting value using find_MAP, but am receiving an optimization error.
The model:
import pymc3 as pm
import numpy as np
import theano
import theano.tensor as tt
# Some helper functions
def cdf(x, location=0, scale=1):
epsilon = np.array(1e-32, dtype=theano.config.floatX)
location = tt.cast(location, theano.config.floatX)
scale = tt.cast(scale, theano.config.floatX)
div = tt.sqrt(2 * scale ** 2 + epsilon)
div = tt.cast(div, theano.config.floatX)
erf_arg = (x - location) / div
return .5 * (1 + tt.erf(erf_arg + epsilon))
def percent_to_thresh(idx, vect):
return 5 * tt.sum(vect[:idx + 1]) + 1.5
def full_thresh(thresh):
idxs = tt.arange(thresh.shape[0] - 1)
thresh_mod, updates = theano.scan(fn=percent_to_thresh,
sequences=[idxs],
non_sequences=[thresh])
return tt.concatenate([[-1 * np.inf, 1.5], thresh_mod, [6.5, np.inf]])
def compute_ps(thresh, location, scale):
f_thresh = full_thresh(thresh)
return cdf(f_thresh[1:], location, scale) - cdf(f_thresh[:-1], location, scale)
# Generate data
real_ps = [0.05, 0.05, 0.1, 0.1, 0.2, 0.3, 0.2]
data = np.random.choice(7, size=1000, p=real_ps)
# Run model
with pm.Model() as model:
mu = pm.Normal('mu', mu=4, sd=3)
sigma = pm.Uniform('sigma', lower=0.1, upper=70)
thresh = pm.Dirichlet('thresh', a=np.ones(5))
cat_p = compute_ps(thresh, mu, sigma)
results = pm.Categorical('results', p=cat_p, observed=data)
with model:
start = pm.find_MAP()
trace = pm.sample(2000, start=start)
When running this, I receive the following error:
Applied interval-transform to sigma and added transformed sigma_interval_ to model.
Applied stickbreaking-transform to thresh and added transformed thresh_stickbreaking_ to model.
Traceback (most recent call last):
File "cm_net_log.v1-for_so.py", line 53, in <module>
start = pm.find_MAP()
File "/usr/local/lib/python3.5/site-packages/pymc3/tuning/starting.py", line 133, in find_MAP
specific_errors)
ValueError: Optimization error: max, logp or dlogp at max have non-finite values. Some values may be outside of distribution support. max: {'thresh_stickbreaking_': array([-1.04298465, -0.48661088, -0.84326554, -0.44833646]), 'sigma_interval_': array(-2.220446049250313e-16), 'mu': array(7.68422528308479)} logp: array(-3506.530143064723) dlogp: array([ 1.61013190e-06, nan, -6.73994118e-06,
-6.93873894e-06, 6.03358122e-06, 3.18954680e-06])Check that 1) you don't have hierarchical parameters, these will lead to points with infinite density. 2) your distribution logp's are properly specified. Specific issues:
My questions:
How can I determine why dlogp is nan at certain points?
Is there a different way that I can express this model to avoid dlogp being nan?
Also worth noting:
This model runs fine if I don't find_MAP and use a Metropolis sampler. However, I'd like to have the flexibility of using other samplers as this model becomes more complex.
I have a suspicion that the issue is due to the relationship between the thresholds and the normal distribution, but I don't know how to disentangle them for the optimization.
Regarding question 2: I expressed the model for the ordinal predicted variable (single group) differently; I used the Theano #as_op decorator for a function that calculates probabilities for the outcomes. That also explains why I cannot use find_MAP() or gradient based samplers: Theano cannot calculate a gradient for the custom function. (http://pymc-devs.github.io/pymc3/notebooks/getting_started.html#Arbitrary-deterministics)
# Number of outcomes
nYlevels = df.Y.cat.categories.size
thresh = [k + .5 for k in range(1, nYlevels)]
thresh_obs = np.ma.asarray(thresh)
thresh_obs[1:-1] = np.ma.masked
#as_op(itypes=[tt.dvector, tt.dscalar, tt.dscalar], otypes=[tt.dvector])
def outcome_probabilities(theta, mu, sigma):
out = np.empty(nYlevels)
n = norm(loc=mu, scale=sigma)
out[0] = n.cdf(theta[0])
out[1] = np.max([0, n.cdf(theta[1]) - n.cdf(theta[0])])
out[2] = np.max([0, n.cdf(theta[2]) - n.cdf(theta[1])])
out[3] = np.max([0, n.cdf(theta[3]) - n.cdf(theta[2])])
out[4] = np.max([0, n.cdf(theta[4]) - n.cdf(theta[3])])
out[5] = np.max([0, n.cdf(theta[5]) - n.cdf(theta[4])])
out[6] = 1 - n.cdf(theta[5])
return out
with pm.Model() as ordinal_model_single:
theta = pm.Normal('theta', mu=thresh, tau=np.repeat(.5**2, len(thresh)),
shape=len(thresh), observed=thresh_obs, testval=thresh[1:-1])
mu = pm.Normal('mu', mu=nYlevels/2.0, tau=1.0/(nYlevels**2))
sigma = pm.Uniform('sigma', nYlevels/1000.0, nYlevels*10.0)
pr = outcome_probabilities(theta, mu, sigma)
y = pm.Categorical('y', pr, observed=df.Y.cat.codes.as_matrix())
http://nbviewer.jupyter.org/github/JWarmenhoven/DBDA-python/blob/master/Notebooks/Chapter%2023.ipynb