How to get gradients during fit or fit_generator in Keras - tensorflow

I need to monitor the gradients in real time during training when using fit or fit_generator methods. This should have been achieved by using custom callback function. However, I don't how to access the gradients correctly. The attribute model.optimizer.update returns tensors of gradients but it need to be fed with data. What I want to get is the value of gradients that have been applied in the last batch during training.
The following answer does not give the corresponding solution because it just define a function to calculate the gradients by feeding extra data.
Getting gradient of model output w.r.t weights using Keras

Related

Keras: Custom loss function with training data not directly related to model

I am trying to convert my CNN written with tensorflow layers to use the keras api in tensorflow (I am using the keras api provided by TF 1.x), and am having issue writing a custom loss function, to train the model.
According to this guide, when defining a loss function it expects the arguments (y_true, y_pred)
https://www.tensorflow.org/guide/keras/train_and_evaluate#custom_losses
def basic_loss_function(y_true, y_pred):
return ...
However, in every example I have seen, y_true is somehow directly related to the model (in the simple case it is the output of the network). In my problem, this is not the case. How do implement this if my loss function depends on some training data that is unrelated to the tensors of the model?
To be concrete, here is my problem:
I am trying to learn an image embedding trained on pairs of images. My training data includes image pairs and annotations of matching points between the image pairs (image coordinates). The input feature is only the image pairs, and the network is trained in a siamese configuration.
I am able to implement this successfully with tensorflow layers and train it sucesfully with tensorflow estimators.
My current implementations builds a tf Dataset from a large database of tf Records, where the features is a dictionary containing the images and arrays of matching points. Before I could easily feed these arrays of image coordinates to the loss function, but here it is unclear how to do so.
There is a hack I often use that is to calculate the loss within the model, by means of Lambda layers. (When the loss is independent from the true data, for instance, and the model doesn't really have an output to be compared)
In a functional API model:
def loss_calc(x):
loss_input_1, loss_input_2 = x #arbirtray inputs, you choose
#according to what you gave to the Lambda layer
#here you use some external data that doesn't relate to the samples
externalData = K.constant(external_numpy_data)
#calculate the loss
return the loss
Using the outputs of the model itself (the tensor(s) that are used in your loss)
loss = Lambda(loss_calc)([model_output_1, model_output_2])
Create the model outputting the loss instead of the outputs:
model = Model(inputs, loss)
Create a dummy keras loss function for compilation:
def dummy_loss(y_true, y_pred):
return y_pred #where y_pred is the loss itself, the output of the model above
model.compile(loss = dummy_loss, ....)
Use any dummy array correctly sized regarding number of samples for training, it will be ignored:
model.fit(your_inputs, np.zeros((number_of_samples,)), ...)
Another way of doing it, is using a custom training loop.
This is much more work, though.
Although you're using TF1, you can still turn eager execution on at the very beginning of your code and do stuff like it's done in TF2. (tf.enable_eager_execution())
Follow the tutorial for custom training loops: https://www.tensorflow.org/tutorials/customization/custom_training_walkthrough
Here, you calculate the gradients yourself, of any result regarding whatever you want. This means you don't need to follow Keras standards of training.
Finally, you can use the approach you suggested of model.add_loss.
In this case, you calculate the loss exaclty the same way I did in the first answer. And pass this loss tensor to add_loss.
You can probably compile a model with loss=None then (not sure), because you're going to use other losses, not the standard one.
In this case, your model's output will probably be None too, and you should fit with y=None.

An Efficient way to Calculate loss function batchwise?

I am using autoencoders to do anomaly detection. So, I have finished training my model and now I want to calculate the reconstruction loss for each entry in the dataset. so that I can assign anomalies to data points with high reconstruction loss.
This is my current code to calculate the reconstruction loss
But this is really slow. By my estimation, it should take 5 hours to go through the dataset whereas training one epoch occurs in approx 55 mins.
I feel that converting to tensor operation is bottlenecking the code, but I can't find a better way to do it.
I've tried changing the batch sizes but it does not make much of a difference. I have to use the convert to tensor part because K.eval is throwing an error if I do it normally.
python
for i in range(0, encoded_dataset.shape[0], batch_size):
y_true = tf.convert_to_tensor(encoded_dataset[i:i+batch_size].values,
np.float32)
y_pred= tf.convert_to_tensor(ae1.predict(encoded_dataset[i:i+batch_size].values),
np.float32)
# Append the batch losses (numpy array) to the list
reconstruction_loss_transaction.append(K.eval(loss_function( y_true, y_pred)))
I was able to train in 55 mins per epoch. So I feel prediction should not take 5 hours per epoch. encoded_dataset is a variable that has the entire dataset in main memory as a data frame.
I am using Azure VM instance.
K.eval(loss_function(y_true,y_pred) is to find the loss for each row of the batch
So y_true will be of size (batch_size,2000) and so will y_pred
K.eval(loss_function(y_true,y_pred) will give me an output of
(batch_size,1) evaluating binary cross entropy on each row of y
_true and y_pred
Moved from comments:
My suspicion is that ae1.predict and K.eval(loss_function) are behaving in unexpected ways. ae1.predict should normally be used to output the loss function value as well as y_pred. When you create the model, specify that the loss value is another output (you can have a list of multiple outputs), then just call predict here once to get both y_pred the loss value in one call.
But I want the loss for each row . Won't the loss returned by the predict method be the mean loss for the entire batch?
The answer depends on how the loss function is implemented. Both ways produce perfectly valid and identical results in TF under the hood. You could average the loss over the batch before taking the gradient w.r.t. the loss, or take the gradient w.r.t. a vector of losses. The gradient operation in TF will perform the averaging of the losses for you if you use the latter approach (see SO articles on taking the per-sample gradient, it's actually hard to do).
If Keras implements the loss with reduce_mean built into the loss, you could just define your own loss. If you're using square loss, replacing 'mean_squared_error' with lambda y_true, y_pred: tf.square(y_pred - y_true). That would produce square error instead of MSE (no difference to the gradient), but look here for the variant including the mean.
In any case this produces a per sample loss so long as you don't use tf.reduce_mean, which is purely optional in the loss. Another option is to simply compute the loss separately from what you optimize for and make that an output of the model, also perfectly valid.

Implementing stochastic forward passes in part of a neural network in Keras?

my problem is the following:
I am working on an object detection problem and would like to use dropout during test time to obtain a distribution of outputs. The object detection network consists of a training model and a prediction model, which wraps around the training model. I would like to perform several stochastic forward passes using the training model and combine these e.g. by averaging the predictions in the prediction wrapper. Is there a way of doing this in a keras model instead of requiring an intermediate processing step using numpy?
Note that this question is not about how to enable dropout during test time
def prediction_wrapper(model):
# Example code.
# Arguments
# model: the training model
regression = model.outputs[0]
classification = model.outputs[1]
predictions = # TODO: perform several stochastic forward passes (dropout during train and test time) here
avg_predictions = # TODO: combine predictions here, e.g. by computing the mean
outputs = # TODO: do some processing on avg_predictions
return keras.models.Model(inputs=model.inputs, outputs=outputs, name=name)
I use keras with a tensorflow backend.
I appreciate any help!
The way I understand, you're trying to average the weight updates for a single sample while Dropout is enabled. Since dropout is random, you would get different weight updates for the same sample.
If this understanding is correct, then you could create a batch by duplicating the same sample. Here I am assuming that the Dropout is different for each sample in a batch. Since, backpropagation averages the weight updates anyway, you would get your desired behavior.
If that does not work, then you could write a custom loss function and train with a batch-size of one. You could update a global counter inside your custom loss function and return non-zero loss only when you've averaged them the way you want it. I don't know if this would work, it's just an idea.

How to cancel BP in some layers in tensorflow?

when I try to fine-tune a VGG network, I only want to update the weights after 5th convolution layers ,in caffe , we can cancel BP in configure file. What should I do in tensorflow ? thanks !
Just use tf.stop_gradient() on the input of your 5th layer. Tensorflow will not backpropagate the error below. tf.stop_gradient() is an operation that acts as the identity function in the forward direction, but stops the gradient in the backward direction.
From documentation:
tf.stop_gradient
Stops gradient computation.
When executed in a graph, this op outputs its input tensor as-is.
When building ops to compute gradients, this op prevents the
contribution of its inputs to be taken into account. Normally, the
gradient generator adds ops to a graph to compute the derivatives of a
specified 'loss' by recursively finding out inputs that contributed to
its computation. If you insert this op in the graph it inputs are
masked from the gradient generator. They are not taken into account
for computing gradients.
Otherwise you can use optimizer.minimize(loss, variables_of_fifth_layer). Here you are running backpropagation and updating only on the variables of your 5th layer.
For a fast selection of the variables of interest you could:
Define as trainable=False all the variables that you don't want to update, and use variables_of_fifth_layer=tf.trainable_variables().
Divide layers by defining specific scopes and then variables_of_fifth_layer = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES,"scope/of/fifth/layer")

Implementing gradient descent in TensorFlow instead of using the one provided with it

I want to use gradient descent with momentum (keep track of previous gradients) while building a classifier in TensorFlow.
So I don't want to use tensorflow.train.GradientDescentOptimizer but I want to use tensorflow.gradients to calculate gradients and keep track of previous gradients and update the weights based on all of them.
How do I do this in TensorFlow?
TensorFlow has an implementation of gradient descent with momentum.
To answer your general question about implementing your own optimization algorithm, TensorFlow gives you the primitives to calculate the gradients, and update variables using the calculated gradients. In your model, suppose loss designates the loss function, and var_list is a python list of TensorFlow variables in your model (which you can get by calling tf.all_variables or tf.trainable_variables, then you can calculate the gradients w.r.t your variables as follows :
grads = tf.gradients(loss, var_list)
For the simple gradient descent, you would simply subtract the product of the gradient and the learning rate from the variable. The code for that would look as follows :
var_updates = []
for grad, var in zip(grads, var_list):
var_updates.append(var.assign_sub(learning_rate * grad))
train_op = tf.group(*var_updates)
You can train your model by calling sess.run(train_op). Now, you can do all sorts of things before actually updating your variables. For instance, you can keep track of the gradients in a different set of variables and use it for the momentum algorithm. Or, you can clip your gradients before updating the variables. All these are simple TensorFlow operations because the gradient tensors are no different from other tensors that you compute in TensorFlow. Please look at the implementations (Momentum, RMSProp, Adam) of some the fancier optimization algorithms to understand how you can implement your own.