I wrote a custom loss function that add the regularization loss to the total loss, I added L2 regularizer to kernels only, but when I called model.fit() a warning appeared which states that the gradients does not exist for those biases, and biases are not updated, also if I remove a regularizer from a kernel of one of the layers, the gradient for that kernel also does not exist.
I tried to add bias regularizer to each layer and everything worked correctly, but I don't want to regularize the biases, so what should I do?
Here is my loss function:
def _loss_function(y_true, y_pred):
# convert tensors to numpy arrays
y_true_n = y_true.numpy()
y_pred_n = y_pred.numpy()
# modify probablities for Knowledge Distillation loss
# we do this for old tasks only
old_y_true = np.float_power(y_true_n[:, :-1], 0.5)
old_y_true = old_y_true / np.sum(old_y_true)
old_y_pred = np.float_power(y_pred_n[:, :-1], 0.5)
old_y_pred = old_y_pred / np.sum(old_y_pred)
# Define the loss that we will used for new and old tasks
bce = tf.keras.losses.BinaryCrossentropy()
# compute the loss on old tasks
old_loss = bce(old_y_true, old_y_pred)
# compute the loss on new task
new_loss = bce(y_true_n[:, -1], y_pred_n[:, -1])
# compute the regularization loss
reg_loss = tf.compat.v1.losses.get_regularization_loss()
assert reg_loss is not None
# convert all tensors to float64
old_loss = tf.cast(old_loss, dtype=tf.float64)
new_loss = tf.cast(new_loss, dtype=tf.float64)
reg_loss = tf.cast(reg_loss, dtype=tf.float64)
return old_loss + new_loss + reg_loss
In keras, loss function should return the loss value without regularization losses. The regularization losses will be added automatically by setting kernel_regularizer or bias_regularizer in each of the keras layers.
In other words, when you write your custom loss function, you don't have to care about regularization losses.
Edit: the reason why you got the warning messages that gradients don't exist is because of the usage of numpy() in your loss function. numpy() will stop any gradient propagation.
The warning messages disappeared after you added regularizers to the layers do not imply that the gradients were then computed correctly. It would only include the gradients from the regularizers but not from the data. numpy() should be removed in the loss function in order to get the correct gradients.
One of the solutions is to keep everything in tensors and use tf.math library. e.g. use tf.pow to replace np.float_power and tf.reduce_sum to replace np.sum
Related
I have attempted to translate pytorch implementation of a NN model which calculates forces and energies in molecular structures to TensorFlow. This needed a custom training loop and custom loss function so I implemented to different one step training functions below.
First using Nested Gradient Tapes.
def calc_gradients(D_train_batch, E_train_batch, F_train_batch, opt):
#set up gradient tape scope in order to track gradients of both d(Loss)/d(Weights)
#and d(output)/d(input)
with tf.GradientTape() as tape1:
with tf.GradientTape() as tape2:
#set gradient tape to watch Tensor
tape2.watch(D_train_batch)
#pass D thru model to get predicted energy vals
E_pred = model(D_train_batch, training=True)
df_dD_train_batch = tape2.gradient(E_pred, D_train_batch)
#matrix mult of -Grad_D(f) x Grad_r(D)
F_pred = -tf.einsum('ijkl,il->ijk', dD_dr_train_batch, df_dD_train_batch)
#calculate loss value
loss = force_energy_loss(E_pred, F_pred, E_train_batch, F_train_batch)
grads = tape1.gradient(loss, model.trainable_weights)
opt.apply_gradients(zip(grads, model.trainable_weights))
Other attempt with gradient tape (persistent = true)
def calc_gradients_persistent(D_train_batch, E_train_batch, F_train_batch, opt):
#set up gradient tape scope in order to track gradients of both d(Loss)/d(Weights)
#and d(output)/d(input)
with tf.GradientTape(persistent = True) as outer:
#set gradient tape to watch Tensor
outer.watch(D_train_batch)
#output values from model, set trainable to be true to get
#model.trainable_weights out
E_pred = model(D_train_batch, training=True)
#set gradient tape to watch trainable weights
outer.watch(model.trainable_weights)
#get gradient of output (f/E_pred) w.r.t input (D/D_train_batch) and cast to double
df_dD_train_batch = outer.gradient(E_pred, D_train_batch)
#matrix mult of -Grad_D(f) x Grad_r(D)
F_pred = -tf.einsum('ijkl,il->ijk', dD_dr_train_batch, df_dD_train_batch)
#calculate loss value
loss = force_energy_loss(E_pred, F_pred, E_train_batch, F_train_batch)
#get gradient of loss w.r.t to trainable weights for back propogation
grads = outer.gradient(loss, model.trainable_weights)
#updates weights using the optimizer and the gradients (grads)
opt.apply_gradients(zip(grads, model.trainable_weights))
These were attempted translations of the pytorch code
# Forward pass: Predict energies from the descriptor input
E_train_pred_batch = model(D_train_batch)
# Get derivatives of model output with respect to input variables. The
# torch.autograd.grad-function can be used for this, as it returns the
# gradients of the input with respect to outputs. It is very important
# to set the create_graph=True in this case. Without it the derivatives
# of the NN parameters with respect to the loss from the force error
# will not be populated (=the force error will not affect the
# training), but the model will still run fine without errors.
df_dD_train_batch = torch.autograd.grad(
outputs=E_train_pred_batch,
inputs=D_train_batch,
grad_outputs=torch.ones_like(E_train_pred_batch),
create_graph=True,
)[0]
# Get derivatives of input variables (=descriptor) with respect to atom
# positions = forces
F_train_pred_batch = -torch.einsum('ijkl,il->ijk', dD_dr_train_batch, df_dD_train_batch)
# Zero gradients, perform a backward pass, and update the weights.
# D_train_batch.grad.data.zero_()
optimizer.zero_grad()
loss = energy_force_loss(E_train_pred_batch, E_train_batch, F_train_pred_batch, F_train_batch)
loss.backward()
optimizer.step()
which is from the tutorial for the Dscribe library at https://singroup.github.io/dscribe/latest/tutorials/machine_learning/forces_and_energies.html
Question
Using either versions of the TF implementation there is a huge loss in prediction accuracy compared to running the pytorch version. I was wondering, have I maybe misunderstood the pytorch code and translated incorrectly and if so where is my discrepancy?
P.S
Model directly computes energies E, from which we use the gradient of E w.r.t D in order to calculate the forces F. The loss function is a weighted sum of MSE of both Force and energies.
These methods are in fact the same, my error was somewhere else which was creating differing results. For anyone whose trying to implement the TensorFlow versions, the nested gradient tapes are about 2x faster, at least in this scenario and also ensure to wrap the functions in an #tf.function in order to use graphs over eager execution, The speed up is about 10x.
I have a encoder model and a decoder model (RNN).
I want to compute the gradients and update the weights.
I'm somewhat confused by what I've seen so far on the web.
Which block is the best practice? Is there any difference between the two options? Gradients seems to converge faster in Block 1, I do not know why?
# BLOCK 1, in two operations
encoder_gradients,decoder_gradients = tape.gradient(loss,[encoder_model.trainable_variables,decoder_model.trainable_variables])
myoptimizer.apply_gradients(zip(encoder_gradients,encoder_model.trainable_variables))
myoptimizer.apply_gradients(zip(decoder_gradients,decoder_model.trainable_variables))
# BLOCK 2, in one operation
gradients = tape.gradient(loss,encoder_model.trainable_variables + decoder_model.trainable_variables)
myoptimizer.apply_gradients(zip(gradients,encoder_model.trainable_variables +
decoder_model.trainable_variables))
You can manually verify this.
First, let's simplify the model. Let the encoder and decoder both be a single dense layer. This is mostly for simplicity and you can print out the weights being applying the gradients, gradients and weights after applying the gradients.
import tensorflow as tf
import numpy as np
from copy import deepcopy
# create a simple model with one encoder and one decoder layer.
class custom_net(tf.keras.Model):
def __init__(self):
super().__init__()
self.encoder = tf.keras.layers.Dense(3, activation='relu')
self.decoder = tf.keras.layers.Dense(3, activation='relu')
def call(self, inp):
return self.decoder(self.encoder(inp))
net = model()
# create dummy input/output
inp = np.random.randn(1,1)
gt = np.random.randn(3,1)
# set persistent to true since we will be accessing the gradient 2 times
with tf.GradientTape(persistent=True) as tape:
out = custom_model(inp)
loss = tf.keras.losses.mean_squared_error(gt, out)
# get the gradients as mentioned in the question
enc_grad, dec_grad = tape.gradient(loss,
[net.encoder.trainable_variables,
net.decoder.trainable_variables])
gradients = tape.gradient(loss,
net.encoder.trainable_variables + net.decoder.trainable_variables)
First, let's use a stateless optimizer like SGD which updates the weights based on the following formula and compare it to the 2 approaches mentioned in the question.
new_weights = weights - learning_rate * gradients.
# Block 1
myoptimizer = tf.keras.optimizers.SGD(learning_rate=1)
# store weights before updating the weights based on the gradients
old_enc_weights = deepcopy(net.encoder.get_weights())
old_dec_weights = deepcopy(net.decoder.get_weights())
myoptimizer.apply_gradients(zip(enc_grad, net.encoder.trainable_variables))
myoptimizer.apply_gradients(zip(dec_grad, net.decoder.trainable_variables))
# manually calculate the weights after gradient update
# since the learning rate is 1, new_weights = weights - grad
cal_enc_weights = []
for weights, grad in zip(old_enc_weights, enc_grad):
cal_enc_weights.append(weights-grad)
cal_dec_weights = []
for weights, grad in zip(old_dec_weights, dec_grad):
cal_dec_weights.append(weights-grad)
for weights, man_calc_weight in zip(net.encoder.get_weights(), cal_enc_weights):
print(np.linalg.norm(weights-man_calc_weight))
for weights, man_calc_weight in zip(net.decoder.get_weights(), cal_dec_weights):
print(np.linalg.norm(weights-man_calc_weight))
# block 2
old_weights = deepcopy(net.encoder.trainable_variables + net.decoder.trainable_variables)
myoptimizer.apply_gradients(zip(gradients, net.encoder.trainable_variables + \
net.decoder.trainable_variables))
cal_weights = []
for weight, grad in zip(old_weights, gradients):
cal_weights.append(weight-grad)
for weight, man_calc_weight in zip(net.encoder.trainable_variables + net.decoder.trainable_variables, cal_weights):
print(np.linalg.norm(weight-man_calc_weight))
You will see that both the methods update the weights in the exact same way.
I think you used an optimizer like Adam/RMSProp which is stateful. For such optimizers invoking apply_gradients will update the optimizer parameters based on the gradient value and sign. In the first case, the optimizer parameters are updated twice and in the second case only once.
I would stick to the second option if I were you, since you are performing just one step of optimization here.
I am coding a vgg16 net with Tensorflow low-level api. The model is test in imagenet12 dataset. Due to computation cost, I split the validation set into 80% training data and 20% test data.
First, the last layer fc8 outputs without the activation of softmax, and I use the tf.nn.softmax_cross_entropy_with_logits_v2(labels, logits) to compute the loss. It finally outputs nan in the training process.
Then I try to add a softmax layer under fc8, but still use tf.nn.softmax_cross_entropy_with_logits_v2(labels, logits) to compute the loss. Surprisingly, the loss outputs normally rather than nan.
Here is the code before adding softmax layer:
def vgg16():
...
fc8_layer = FullConnectedLayer(y, self.weight_dict, regularizer_fc=self.regularizer_fc)
self.op_logits = fc8_layer.layer_output
def loss(self):
entropy = tf.nn.softmax_cross_entropy_with_logits_v2(labels=self.Y, logits=self.op_logits)
l2_loss = tf.losses.get_regularization_loss()
self.op_loss = tf.reduce_mean(entropy, name='loss') + l2_loss
and I change the vgg16 output like:
def vgg16():
...
fc8_layer = FullConnectedLayer(y, self.weight_dict, regularizer_fc=self.regularizer_fc)
self.op_logits = tf.nn.softmax(fc8_layer.layer_output)
Besides, here is my optimizer:
def optimize(self):
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
with tf.variable_scope(tf.get_variable_scope(), reuse=tf.AUTO_REUSE):
self.opt = tf.train.MomentumOptimizer(learning_rate=self.config.learning_rate, momentum=0.9,
use_nesterov=True)
self.op_opt = self.opt.minimize(self.op_loss)
I don't understand why adding a softmax layer works. In my idea, two softmax layers wont affect the final loss, since it doesn't change the proportion of each output unit.
I am trying to edit my own model by adding some code to cifar10.py and here is the question.
In cifar10.py, the [tutorial][1] says:
EXERCISE: The output of inference are un-normalized logits. Try editing the network architecture to return normalized predictions using tf.nn.softmax().
So I directly input the output from "local4" to tf.nn.softmax(). This gives me the scaled logits which means the sum of all logits is 1.
But in the loss function, the cifar10.py code uses:
tf.nn.sparse_softmax_cross_entropy_with_logits()
and description of this function says
WARNING: This op expects unscaled logits, since it performs a softmax on logits internally for efficiency. Do not call this op with the output of softmax, as it will produce incorrect results.
Also, according to the description, logits as input to above funtion must have the shape [batch_size, num_classes] and it means logits should be unscaled softmax, like sample code calculate unnormalized softmaxlogit as follow.
# softmax, i.e. softmax(WX + b)
with tf.variable_scope('softmax_linear') as scope:
weights = _variable_with_weight_decay('weights', [192, NUM_CLASSES],
stddev=1/192.0, wd=0.0)
biases = _variable_on_cpu('biases', [NUM_CLASSES],
tf.constant_initializer(0.0))
softmax_linear = tf.add(tf.matmul(local4, weights), biases, name=scope.name)
_activation_summary(softmax_linear)
Does this mean I don't have to use tf.nn.softmax in the code?
You can use tf.nn.softmax in the code if you want, but then you will have to compute the loss yourself:
softmax_logits = tf.nn.softmax(logits)
loss = tf.reduce_mean(- labels * tf.log(softmax_logits) - (1. - labels) * tf.log(1. - softmax_logits))
In practice, you don't use tf.nn.softmax for computing the loss. However you need to use tf.nn.softmax if for instance you want to compute the predictions of your algorithm and compare them to the true labels (to compute accuracy).
In the cifar10 example, the gradients of loss with respect to parameters can be computed as follows:
grads_and_vars = opt.compute_gradients(loss)
for grad, var in grads_and_vars:
# ...
Is there any way to get the gradients of loss with respect to activations (not the parameters), and watch them in Tensorboard?
You can use the tf.gradients() function to compute the gradient of any scalar tensor with respect to any other tensor (assuming the gradients are defined for all of the ops between those two tensors):
activations = ...
loss = f(..., activations) # `loss` is some function of `activations`.
grad_wrt_activations, = tf.gradients(loss, [activation])
Visualizing this in TensorBoard is tricky in general, since grad_wrt_activation is (typically) a tensor with the same shape as activation. Adding a tf.histogram_summary() op is probably the easiest way to visualize
# Adds a histogram of `grad_wrt_activations` to the graph, which will be logged
# with the other summaries, and shown in TensorBoard.
tf.histogram_summary("Activation gradient", grad_wrt_activations)