What is the proper way to weight decay for Adam Optimizer - tensorflow

Since Adam Optimizer keeps an pair of running averages like mean/variance for the gradients, I wonder how it should properly handle weight decay. I have seen two ways of implementing it.
Only update mean/variance from the gradients based on the objective loss, decay weight explicitly at each mini-batch. (the following code is taken from https://github.com/dmlc/mxnet/blob/v0.7.0/python/mxnet/optimizer.py)
weight[:] -= lr*mean/(sqrt(variance) + self.epsilon)
wd = self._get_wd(index)
if wd > 0.:
weight[:] -= (lr * wd) * weight
Update mean/variance from the gradients based on the objective loss + regularization loss, and update weights like usual. (the following code is taken from https://github.com/dmlc/mxnet/blob/master/src/operator/optimizer_op-inl.h#L210)
grad = scalar<DType>(param.rescale_grad) * grad +
scalar<DType>(param.wd) * weight;
// stuff
Assign(out, req[0],
weight -
scalar<DType>(param.lr) * mean /
(F<square_root>(var) + scalar<DType>(param.epsilon)));
These two approaches sometimes show significant difference in training results. And I actually think the first one makes more sense (and find it gives better results time to time). Caffe and old version of mxnet follow the first approach, while torch, tensorflow and new version of mxnet follow the second one.
Really appreciate your help!

Edit: see also this PR which just got merged into TF.
When using pure SGD (without momentum) as an optimizer, weight decay is the same thing as adding a L2-regularization term to the loss. When using any other optimizer, this is not true.
Weight decay (don't know how to TeX here, so excuse my pseudo-notation):
w[t+1] = w[t] - learning_rate * dw - weight_decay * w
L2-regularization:
loss = actual_loss + lambda * 1/2 sum(||w||_2 for w in network_params)
Computing the gradient of the extra term in L2-regularization gives lambda * w and thus inserting it into the SGD update equation
dloss_dw = dactual_loss_dw + lambda * w
w[t+1] = w[t] - learning_rate * dw
gives the same as weight decay, but mixes lambda with the learning_rate. Any other optimizer, even SGD with momentum, gives a different update rule for weight decay as for L2-regularization! See the paper Fixing weight decay in Adam for more details. (Edit: AFAIK, this 1987 Hinton paper introduced "weight decay", literally as "each time the weights are updated, their magnitude is also decremented by 0.4%" at page 10)
That being said, there doesn't seem to be support for "proper" weight decay in TensorFlow yet. There are a few issues discussing it, specifically because of above paper.
One possible way to implement it is by writing an op that does the decay step manually after every optimizer step. A different way, which is what I'm currently doing, is using an additional SGD optimizer just for the weight decay, and "attaching" it to your train_op. Both of these are just crude work-arounds, though. My current code:
# In the network definition:
with arg_scope([layers.conv2d, layers.dense],
weights_regularizer=layers.l2_regularizer(weight_decay)):
# define the network.
loss = # compute the actual loss of your problem.
train_op = optimizer.minimize(loss, global_step=global_step)
if args.weight_decay not in (None, 0):
with tf.control_dependencies([train_op]):
sgd = tf.train.GradientDescentOptimizer(learning_rate=1.0)
train_op = sgd.minimize(tf.add_n(tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)))
This somewhat makes use of TensorFlow's provided bookkeeping. Note that the arg_scope takes care of appending an L2-regularization term for every layer to the REGULARIZATION_LOSSES graph-key, which I then all sum up and optimize using SGD which, as shown above, corresponds to actual weight-decay.
Hope that helps, and if anyone gets a nicer code snippet for this, or TensorFlow implements it better (i.e. in the optimizers), please share.

I came across the same question. I think this code that I got from here will work for you. It implements the weight decay adam optimizer by inheritance from the tf.train.Optimizer. This is the cleanest solution I have found:
class AdamWeightDecayOptimizer(tf.train.Optimizer):
"""A basic Adam optimizer that includes "correct" L2 weight decay."""
def __init__(self,
learning_rate,
weight_decay_rate=0.0,
beta_1=0.9,
beta_2=0.999,
epsilon=1e-6,
exclude_from_weight_decay=None,
name="AdamWeightDecayOptimizer"):
"""Constructs a AdamWeightDecayOptimizer."""
super(AdamWeightDecayOptimizer, self).__init__(False, name)
self.learning_rate = learning_rate
self.weight_decay_rate = weight_decay_rate
self.beta_1 = beta_1
self.beta_2 = beta_2
self.epsilon = epsilon
self.exclude_from_weight_decay = exclude_from_weight_decay
def apply_gradients(self, grads_and_vars, global_step=None, name=None):
"""See base class."""
assignments = []
for (grad, param) in grads_and_vars:
if grad is None or param is None:
continue
param_name = self._get_variable_name(param.name)
m = tf.get_variable(
name=param_name + "/adam_m",
shape=param.shape.as_list(),
dtype=tf.float32,
trainable=False,
initializer=tf.zeros_initializer())
v = tf.get_variable(
name=param_name + "/adam_v",
shape=param.shape.as_list(),
dtype=tf.float32,
trainable=False,
initializer=tf.zeros_initializer())
# Standard Adam update.
next_m = (
tf.multiply(self.beta_1, m) + tf.multiply(1.0 - self.beta_1, grad))
next_v = (
tf.multiply(self.beta_2, v) + tf.multiply(1.0 - self.beta_2,
tf.square(grad)))
update = next_m / (tf.sqrt(next_v) + self.epsilon)
# Just adding the square of the weights to the loss function is *not*
# the correct way of using L2 regularization/weight decay with Adam,
# since that will interact with the m and v parameters in strange ways.
#
# Instead we want ot decay the weights in a manner that doesn't interact
# with the m/v parameters. This is equivalent to adding the square
# of the weights to the loss with plain (non-momentum) SGD.
if self._do_use_weight_decay(param_name):
update += self.weight_decay_rate * param
update_with_lr = self.learning_rate * update
next_param = param - update_with_lr
assignments.extend(
[param.assign(next_param),
m.assign(next_m),
v.assign(next_v)])
return tf.group(*assignments, name=name)
def _do_use_weight_decay(self, param_name):
"""Whether to use L2 weight decay for `param_name`."""
if not self.weight_decay_rate:
return False
if self.exclude_from_weight_decay:
for r in self.exclude_from_weight_decay:
if re.search(r, param_name) is not None:
return False
return True
def _get_variable_name(self, param_name):
"""Get the variable name from the tensor name."""
m = re.match("^(.*):\\d+$", param_name)
if m is not None:
param_name = m.group(1)
return param_name
And you can use it in the following way (I have made some changes to make it useful in a more general context), This function will return a train_op that can be used in the Session:
def create_optimizer(loss, init_lr, num_train_steps, num_warmup_steps):
"""Creates an optimizer training op."""
global_step = tf.train.get_or_create_global_step()
learning_rate = tf.constant(value=init_lr, shape=[], dtype=tf.float32)
# Implements linear decay of the learning rate.
learning_rate = tf.train.polynomial_decay(
learning_rate,
global_step,
num_train_steps,
end_learning_rate=0.0,
power=1.0,
cycle=False)
# Implements linear warmup. I.e., if global_step < num_warmup_steps, the
# learning rate will be `global_step/num_warmup_steps * init_lr`.
if num_warmup_steps:
global_steps_int = tf.cast(global_step, tf.int32)
warmup_steps_int = tf.constant(num_warmup_steps, dtype=tf.int32)
global_steps_float = tf.cast(global_steps_int, tf.float32)
warmup_steps_float = tf.cast(warmup_steps_int, tf.float32)
warmup_percent_done = global_steps_float / warmup_steps_float
warmup_learning_rate = init_lr * warmup_percent_done
is_warmup = tf.cast(global_steps_int < warmup_steps_int, tf.float32)
learning_rate = (
(1.0 - is_warmup) * learning_rate + is_warmup * warmup_learning_rate)
# It is recommended that you use this optimizer for fine tuning, since this
# is how the model was trained (note that the Adam m/v variables are NOT
# loaded from init_checkpoint.)
optimizer = AdamWeightDecayOptimizer(
learning_rate=learning_rate,
weight_decay_rate=0.01,
beta_1=0.9,
beta_2=0.999,
epsilon=1e-6)
tvars = tf.trainable_variables()
grads = tf.gradients(loss, tvars)
# You can do clip gradients if you need in this step(in general it is not neccessary)
# (grads, _) = tf.clip_by_global_norm(grads, clip_norm=1.0)
train_op = optimizer.apply_gradients(
zip(grads, tvars), global_step=global_step)
# Normally the global step update is done inside of `apply_gradients`.
# However, `AdamWeightDecayOptimizer` doesn't do this. But if you use
# a different optimizer, you should probably take this line out.
new_global_step = global_step + 1
train_op = tf.group(train_op, [global_step.assign(new_global_step)])
return train_op

Related

PINN : Learning parameters through gradient descent does not lead to appropriate values and decrease quality of learning

I'm trying to implement a Physical Informed Neural Network. The differential part in the loss did bring some improvment (compare to the classical neural net) in the (supposed) unknown area. This unknown area is actually known but I just removed them from training and testing data set to check performance of PINN vs other technics. Here is the code I m using :
model = tf.keras.Sequential([
layers.Dense(units=64, activation='relu', input_shape=(2,)),
layers.Dense(units=64, activation='relu'),
layers.Dense(units=1,)
])
optimizer = tf.keras.optimizers.Adam()
objective = tf.keras.losses.Huber()
metric = tf.keras.metrics.MeanAbsoluteError()
w_phys = 0.5
w_loss = 1.0 - w_phys
with tf.device('gpu:0'):
for epoch in range(epochs):
cumulative_loss_train = 0.0
metric.reset_states()
for mini_batch, gdth in dataset:
with tf.GradientTape(persistent=True) as tape:
tape.watch(unknown_area_SOCP_tensor)
tape.watch(mini_batch)
# Physics loss
predictions_unkwon = model(unknown_area_SOCP_tensor, training=True)
d_f = tape.gradient(predictions_unkwon, unknown_area_SOCP_tensor)
# Physics part with P #
dp = tf.convert_to_tensor(1/((K*unknown_area_SOCP_tensor[:,0]+L)**2-4*R*unknown_area_SOCP_tensor[:,1]), dtype = np.float64)
phys_loss_p = 10*tf.cast(tf.math.reduce_mean(tf.math.square(d_f[:,1]**2 - dp)), np.float32)
# Traditionall loss #
predictions = model(mini_batch, training=True)
loss = objective(gdth, predictions)
# Compute grads #
grads = tape.gradient(w_loss*loss + w_phys*(phys_loss_p), model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
cumulative_loss_train += loss
metric.update_state(gdth, predictions)
del tape
So far so good. K, R and L were fixed parameter.
Next step was to assume they were unknwon and try to figure out if we could learn them.
I tried first by focusing only on R parameter. Here is the code used :
with tf.device('gpu:0'):
for epoch in range(epochs):
cumulative_loss_train = 0.0
metric.reset_states()
for mini_batch, gdth in dataset:
with tf.GradientTape(persistent=True) as tape:
tape.watch(unknown_area_SOCP_tensor)
tape.watch(mini_batch)
tape.watch(R)
# Physics loss
predictions_unkwon = model(unknown_area_SOCP_tensor, training=True)
d_f = tape.gradient(predictions_unkwon, unknown_area_SOCP_tensor)
# Physics part with P #
dp = tf.convert_to_tensor(1/((K*unknown_area_SOCP_tensor[:,0]+L)**2-4*R*unknown_area_SOCP_tensor[:,1]), dtype = np.float64)
phys_loss_p = 10*tf.cast(tf.math.reduce_mean(tf.math.square(d_f[:,1]**2 - dp)), np.float32)
# Traditionall loss #
predictions = model(mini_batch, training=True)
loss = objective(gdth, predictions)
# Compute grads #
grads = tape.gradient(w_loss*loss + w_phys*(phys_loss_p), model.trainable_variables + [R])
optimizer.apply_gradients(zip(grads, model.trainable_variables + [R]))
cumulative_loss_train += loss
metric.update_state(gdth, predictions)
del tape
But that lead to terrible result (like high loss and poor metric). Worse, the value of R has to be positive, and at the end of the training, R was estimated as a negative value...
I'm quite confident on the equation since I have checked a lot of time, and it seems accurate compared to simulation software I'm using. Moreover, the equation brought value to the learning (as predictions on the unknwon were way better).
Did I miss something here ?
Thanks for your help !
I post my answer here in case this may help someone one day.
My issue came from gradient value which was too high. Clipping gradients finally solved my issue.

Problem with Deep Sarsa algorithm which work with pytorch (Adam optimizer) but not with keras/Tensorflow (Adam optimizer)

I have a deep sarsa algorithm which work great on Pytorch on lunar-lander-v2 and I would use with Keras/Tensorflow. It use mini-batch of size 64 which are used 128 time to train at each episode.
There are the results I get. As you can see, it work great with Pytorch but not with Keras / Tensorflow... So I think I do not correctly implement the training function is Keras/Tensorflow (code is below).
It seems that loss is oscillating in Keras because epsilon go to early to slow value but it work very great in Pytorch...
Do you see something that could explain why it do not work in Keras/Tensorflow please? Thanks a lot for your help and any idea that could help me...
Network information:
It use Adam optimizer, and a network with two layers : 256 and 128, with relu on each:
class Q_Network(nn.Module):
def __init__(self, state_dim , action_dim):
super(Q_Network, self).__init__()
self.x_layer = nn.Linear(state_dim, 256)
self.h_layer = nn.Linear(256, 128)
self.y_layer = nn.Linear(128, action_dim)
print(self.x_layer)
def forward(self, state):
xh = F.relu(self.x_layer(state))
hh = F.relu(self.h_layer(xh))
state_action_values = self.y_layer(hh)
return state_action_values
For keras/Tensorflwo I use this one:
def CreationModele(dimension):
entree_etat = keras.layers.Input(shape=(dimension))
sortie = keras.layers.Dense(units=256, activation='relu')(entree_etat)
sortie = keras.layers.Dense(units=128, activation='relu')(sortie)
sortie = keras.layers.Dense(units=4)(sortie)
modele = keras.Model(inputs=entree_etat,outputs=sortie)
return modele
Training code
In Pytorch, the training is done by:
def update_Sarsa_Network(self, state, next_state, action, next_action, reward, ends):
actions_values = torch.gather(self.qnet(state), dim=1, index=action.long())
next_actions_values = torch.gather(self.qnet(next_state), dim=1, index=next_action.long())
next_actions_values = reward + (1.0 - ends) * (self.discount_factor * next_actions_values)
q_network_loss = self.MSELoss_function(actions_values, next_actions_values.detach())
self.qnet_optim.zero_grad()
q_network_loss.backward()
self.qnet_optim.step()
return q_network_loss
And in Keras/Tensorflow by:
mse = keras.losses.MeanSquaredError(
reduction=keras.losses.Reduction.SUM)
#tf.function
def train(model, batch_next_states_tensor, batch_next_actions_tensor, batch_reward_tensor, batch_end_tensor, batch_states_tensor, batch_actions_tensor, optimizer, gamma):
with tf.GradientTape() as tape:
# EStimation des valeurs des actions courantes
actions_values = model(batch_states_tensor) # (mini_batch_size,4)
actions_values = tf.linalg.diag_part(tf.gather(actions_values,batch_actions_tensor,axis=1)) # (mini_batch_size,)
actions_values = tf.expand_dims(actions_values,-1) # (mini_batch_size,1)
# EStimation des valeurs des actions suivantes
next_actions_values = model(batch_next_states_tensor) # (mini_batch_size,4)
next_actions_values = tf.linalg.diag_part(tf.gather(next_actions_values,batch_next_actions_tensor,axis=1)) # (mini_batch_size,)
cibles = batch_reward_tensor + (1.0 - batch_end_tensor)*gamma*tf.expand_dims(next_actions_values,-1) # (mini_batch_size,1)
error = mse(cibles, actions_values)
grads = tape.gradient(error, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
return error
Error function and Optimizer code
The optimizer is Adam in Pytorch and Tensorflow with lr=0.001. In Pytorch:
def __init__(self, state_dim, action_dim):
self.qnet = Q_Network(state_dim, action_dim)
self.qnet_optim = torch.optim.Adam(self.qnet.parameters(), lr=0.001)
self.discount_factor = 0.99
self.MSELoss_function = nn.MSELoss(reduction='sum')
self.replay_buffer = ReplayBuffer()
pass
In Keras / Tensorflow:
alpha = 1e-3
# Initialise le modèle
modele_Keras = CreationModele(8)
optimiseur_Keras = keras.optimizers.Adam(learning_rate=alpha)
Ok I finnaly foud a solution by de-correlate target and action value using two model, one being updated periodically for target values calculation.
I use a model for estimating the epsilon-greedy actions and computing the Q(s,a) values and a fixed model (but periodically uptated with the weight of the previous model) for calculate the targer r+gamma*Q(s',a').
Here is my result :

Tensorflow Neural Machine Translation Example - Loss Function

Im stepping through the code here: https://www.tensorflow.org/tutorials/text/nmt_with_attention
as a learning method and I am confused as to when the loss function is called and what is passed. I added two print statements in the loss_function and when the training loop runs, it only prints out
(64,)
(64, 4935)
at the very start multiple times and then nothing again. I am confused on two fronts:
Why doesnt the loss_function() get called repeatedly through the training loop and print the shapes? I expected that the loss function would get called at the end of each batch which is of size 64.
I expected the shapes of the actuals to be (batch size, time steps) and the predictions to be (batch size, time steps, vocabulary size). It looks like the loss gets called seperately for every time step (64 is the batch size and 4935 is the vocabulary size).
The relevant bits I believe are reproduced below.
optimizer = tf.keras.optimizers.Adam()
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction='none')
def loss_function(real, pred):
mask = tf.math.logical_not(tf.math.equal(real, 0))
print(real.shape)
print(pred.shape)
loss_ = loss_object(rea
l, pred)
mask = tf.cast(mask, dtype=loss_.dtype)
loss_ *= mask #set padding entries to zero loss
return tf.reduce_mean(loss_)
#tf.function
def train_step(inp, targ, enc_hidden):
loss = 0
with tf.GradientTape() as tape:
enc_output, enc_hidden = encoder(inp, enc_hidden)
dec_hidden = enc_hidden
dec_input = tf.expand_dims([targ_lang.word_index['<start>']] * BATCH_SIZE, 1)
# Teacher forcing - feeding the target as the next input
for t in range(1, targ.shape[1]):
# passing enc_output to the decoder
predictions, dec_hidden, _ = decoder(dec_input, dec_hidden, enc_output)
print(targ[:, t])
print(predictions)
loss += loss_function(targ[:, t], predictions)
# using teacher forcing
dec_input = tf.expand_dims(targ[:, t], 1)
batch_loss = (loss / int(targ.shape[1]))
variables = encoder.trainable_variables + decoder.trainable_variables
gradients = tape.gradient(loss, variables)
optimizer.apply_gradients(zip(gradients, variables))
return batch_loss
EPOCHS = 10
for epoch in range(EPOCHS):
start = time.time()
enc_hidden = encoder.initialize_hidden_state()
total_loss = 0
for (batch, (inp, targ)) in enumerate(dataset.take(steps_per_epoch)):
#print(batch)
batch_loss = train_step(inp, targ, enc_hidden)
total_loss += batch_loss
if batch % 100 == 0:
print('Epoch {} Batch {} Loss {:.4f}'.format(epoch + 1,
batch,
batch_loss.numpy()))
# saving (checkpoint) the model every 2 epochs
if (epoch + 1) % 2 == 0:
checkpoint.save(file_prefix = checkpoint_prefix)
print('Epoch {} Loss {:.4f}'.format(epoch + 1,
total_loss / steps_per_epoch))
print('Time taken for 1 epoch {} sec\n'.format(time.time() - start))
The loss is treated similar to the rest of the graph. In tensorflow calls like tf.keras.layers.Dense and tf.nn.conv2d don't actually do the operation, but instead they define the graph for the operations. I have another post here How do backpropagation works in tensorflow that explains the backprop and some motivation of why this is.
The loss function you have above is
def loss_function(real, pred):
mask = tf.math.logical_not(tf.math.equal(real, 0))
print(real.shape)
print(pred.shape)
loss_ = loss_object(real, pred)
mask = tf.cast(mask, dtype=loss_.dtype)
loss_ *= mask #set padding entries to zero loss
result = tf.reduce_mean(loss_)
return result
Think of this function as a generate that returns result. Result defines the graph to compute the loss. Perhaps a better name for this function would be loss_function_graph_creator ... but that's another story.
Result, which is a graph that contains weights, bias, and information about how to both do the forward propagation and the back propagation is all model.fit needs. It no longer needs this function and it doesn't need to run the function every loop.
Truly, what is happening under the covers is that given your model (called my_model), the compile line
model.compile(loss=loss_function, optimizer='sgd')
is effectively the following lines
input = tf.keras.Input()
output = my_model(input)
loss = loss_function(input,output)
opt = tf.keras.optimizers.SGD()
gradient = opt.minimize(loss)
get_gradient_model = tf.keras.Model(input,gradient)
and there you have the gradient operation which can be use in a loop to get the gradients, which is conceptually what model.fit does.
Q and A
Is the fact that this function: #tf.function def train_step(inp, targ, enc_hidden): has the tf.function decorator (and the loss function is called in it) what makes this code run as you describe and not normal python?
No. It is not 'normal' python. It only defines the flow of tensors through the graph of matrix operations that will (hopefully) run on your GPU. All the tensorflow operations just set up the operations on the GPU (or a simulated GPU if you don't have one).
How can I tell the actual shapes being passed into loss_function (the second part of my question)?
No problem at all... simply run this code
loss_function(y, y).shape
This will compute the loss function of your expected output compared exactly to the same output. The loss will (hopefully) be zero, but actually calculating the value of the loss wasn't the point. You want the shape and this will give it to you.

Tensorflow Eager Execution does not work with Learning Rate decay

trying here to make an eager exec model work with LR decay, but no success. It seems to be a bug, since it appear that the learning rate decay tensor does not get updated. If I am missing something can you land a hand here. Thanks.
The code bellow is learning some word embeddings. However, the learning rate decay section does not work at all.
class Word2Vec(tf.keras.Model):
def __init__(self, vocab_size, embed_size, num_sampled=NUM_SAMPLED):
self.vocab_size = vocab_size
self.num_sampled = num_sampled
self.embed_matrix = tfe.Variable(tf.random_uniform(
[vocab_size, embed_size]), name="embedding_matrix")
self.nce_weight = tfe.Variable(tf.truncated_normal(
[vocab_size, embed_size],
stddev=1.0 / (embed_size ** 0.5)), name="weights")
self.nce_bias = tfe.Variable(tf.zeros([vocab_size]), name="biases")
def compute_loss(self, center_words, target_words):
"""Computes the forward pass of word2vec with the NCE loss."""
embed = tf.nn.embedding_lookup(self.embed_matrix, center_words)
loss = tf.reduce_mean(tf.nn.nce_loss(weights=self.nce_weight,
biases=self.nce_bias,
labels=target_words,
inputs=embed,
num_sampled=self.num_sampled,
num_classes=self.vocab_size))
return loss
def gen():
yield from word2vec_utils.batch_gen(DOWNLOAD_URL, EXPECTED_BYTES,
VOCAB_SIZE, BATCH_SIZE, SKIP_WINDOW,
VISUAL_FLD)
def main():
dataset = tf.data.Dataset.from_generator(gen, (tf.int32, tf.int32),
(tf.TensorShape([BATCH_SIZE]),
tf.TensorShape([BATCH_SIZE, 1])))
global_step = tf.train.get_or_create_global_step()
starter_learning_rate = 1.0
end_learning_rate = 0.01
decay_steps = 1000
learning_rate = tf.train.polynomial_decay(starter_learning_rate, global_step.numpy(),
decay_steps, end_learning_rate,
power=0.5)
train_writer = tf.contrib.summary.create_file_writer('./checkpoints')
train_writer.set_as_default()
optimizer = tf.train.MomentumOptimizer(learning_rate, momentum=0.95)
model = Word2Vec(vocab_size=VOCAB_SIZE, embed_size=EMBED_SIZE)
grad_fn = tfe.implicit_value_and_gradients(model.compute_loss)
total_loss = 0.0 # for average loss in the last SKIP_STEP steps
checkpoint_dir = "./checkpoints/"
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")
root = tfe.Checkpoint(optimizer=optimizer,
model=model,
optimizer_step=tf.train.get_or_create_global_step())
while global_step < NUM_TRAIN_STEPS:
for center_words, target_words in tfe.Iterator(dataset):
with tf.contrib.summary.record_summaries_every_n_global_steps(100):
if global_step >= NUM_TRAIN_STEPS:
break
loss_batch, grads = grad_fn(center_words, target_words)
tf.contrib.summary.scalar('loss', loss_batch)
tf.contrib.summary.scalar('learning_rate', learning_rate)
# print(grads)
# print(len(grads))
total_loss += loss_batch
optimizer.apply_gradients(grads, global_step)
if (global_step.numpy() + 1) % SKIP_STEP == 0:
print('Average loss at step {}: {:5.1f}'.format(
global_step.numpy(), total_loss / SKIP_STEP))
total_loss = 0.0
root.save(file_prefix=checkpoint_prefix)
if __name__ == '__main__':
main()
Note that when eager execution is enabled, the tf.Tensor objects represent concrete values (as opposed to symbolic handles of computation that will occur on Session.run() calls).
As a result, in your code snippet above, the line:
learning_rate = tf.train.polynomial_decay(starter_learning_rate, global_step.numpy(),
decay_steps, end_learning_rate,
power=0.5)
is computing the decayed value once, using the global_step at the time it was invoked, and when the optimizer is being created with:
optimizer = tf.train.MomentumOptimizer(learning_rate, momentum=0.95)
it is being given a fixed learning rate.
To decay the learning rate, you'd want to invoke tf.train.polynomial_decay repeatedly (with updated values for global_step). One way to do this would be to replicate what is done in the RNN example, using something like this:
starter_learning_rate = 1.0
learning_rate = tfe.Variable(starter_learning_rate)
optimizer = tf.train.MomentumOptimizer(learning_rate, momentum=0.95)
while global_step < NUM_TRAIN_STEPS:
# ....
learning_rate.assign(tf.train.polynomial_decay(starter_learning_rate, global_step, decay_steps, end_learning_rate, power=0.5))
This way you've captured the learning_rate in a variable that can be updated. Furthermore, it's simple to include the current learning_rate in the checkpoint as well (by including it when creating the Checkpoint object).
Hope that helps.

Tensorflow: can not obtain same result mini-batch SGD optimizer compared to Kaldi nnet1

I am trying to build a Tensorflow example with a simple multl-layer
perceptron (MLP) functionality with one hidden layer. However, when I tested it and compared to other software e.g. Kaldi nnet1, the convergence during the training is not efficient, or cannot be comparable to Kaldi nnet1. I tried my best to make all the parameters the same (input, int target, batch size, learning rate, etc.), however, still confused where could be the reasons. Some pieces of codes are as follows:
Initialization:
self.weight = [tf.Variable(tf.truncated_normal([440, 8192],stddev=0.1))]
self.bias = [tf.Variable(tf.constant(0.01, shape=8192))]
self.weight.append( tf.Variable(tf.truncated_normal([8192, 8],stddev=0.1)) )
self.bias.append( tf.Variable(tf.constant(0.01, shape=8)) )
self.act = [tf.nn.sigmoid( tf.matmul(self.input, self.weight[0]) + self.bias[0] )]
self.nn_out = tf.matmul(self.act, self.weight[1]) + self.bias[1])
self.nn_softmax = tf.nn.softmax(self.nn_out)
self.nn_tgt = tf.placeholder("int64", shape=[None,])
self.cost_mean = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(self.nn_out, self.nn_tgt))
self.train_step = tf.train.GradientDescentOptimizer(self.learn_rate).minimize(self.cost_mean)
# saver
self.saver = tf.train.Saver()
self.sess = tf.Session()
self.sess.run(tf.initialize_all_variables())
Training:
for epoch in xrange(20):
feats_tr, tgts_tr = shuffle(feats_tr, tgts_tr, random_state=777)
# restore the exisiting model
ckpt = tf.train.get_checkpoint_state(ckpt_dir)
if ckpt and ckpt.model_checkpoint_path:
self.load(ckpt.model_checkpoint_path)
# mini-batch
tr_loss = []
for idx_begin in range(0,len(feats_tr), 512):
idx_end = idx_begin + batch_size
batch_feats, batch_tgts = feats_tr[idx_begin:idx_end],tgts_tr[idx_begin:idx_end]
_, loss_val = self.sess.run([self.train_step, self.cost_mean], feed_dict = {self.nn_in: batch_feats,
self.nn_tgt: batch_tgts,self.learn_rate: learn_rate})
tr_loss.append(loss_val)
# cross-validation
cv_loss = []
for idx_begin in range(0,len(feats_cv), 512):
idx_end = idx_begin + batch_size
batch_feats, batch_tgts = feats[idx_begin:idx_end],tgts[idx_begin:idx_end]
loss_all.append(self.sess.run(self.cost_mean,
feed_dict = { self.nn_in: batch_feats,
self.nn_tgt: batch_tgts}))
print( "Avg Loss for Training: "+str(np.mean(tr_loss)) + \
" Avg Loss for Validation: "+str(np.mean(cv_loss)) )
# save model per epoch if np.mean(cv_loss) less than previous
if (epoch+1)%1==0:
if loss_new < loss:
loss = loss_new
print( "Model accepted in epoch %d" %(epoch+1) )
# save model to ckpt_dir with mdl_nam
self.saver.save(self.sess, mdl_nam, global_step=epoch+1)
else:
print( "Model rejected in epoch %d" %(epoch+1) )
and I generated a simple annealing learning rate control as: if the average of cross-validation loss is not improved by a certain threshold, then halving the 'learn_late' with initial 0.008.
I checked all the parameters when compared to Kaldi nnet1, and the only difference now is the initialization parameters of weights and biases. I am not sure whether initialization will affect too much. However, the convergence in terms of 'cv_loss' during training in Tensorflow (Avg. CV Loss 1.99) is not good as Kaldi nnet1 (Avg. CV Loss 0.95). Can someone help to point out where I did something wrong or I missed something?
Many thanks in advance !!!
At each epoch, you call self.load(ckpt.model_checkpoint_path) which seems to load previously saved weights.
Your model cannot learn if it is reset to the initial weights at each epoch.