i want to create a custom layer with weights that update only in training phase.
from the official documentation this is the way:
from keras import backend as K
from keras.layers import Layer
class MyLayer(Layer):
def __init__(self, output_dim, **kwargs):
self.output_dim = output_dim
super(MyLayer, self).__init__(**kwargs)
def build(self, input_shape):
# Create a trainable weight variable for this layer.
self.kernel = self.add_weight(name='kernel',
shape=(input_shape[1], self.output_dim),
initializer='uniform',
trainable=True)
super(MyLayer, self).build(input_shape) # Be sure to call this at the end
def call(self, x):
return K.dot(x, self.kernel)
def compute_output_shape(self, input_shape):
return (input_shape[0], self.output_dim)
in this github repo
the author added
new_centers = self.centers - self.alpha * delta_centers
self.add_update((self.centers, new_centers), x)
where self.centers are the weights.
I cant understand why self.add_update is useful in that situation.
Weights are not updated if i dont add self.add_update? If not, why new_centers must be in the updates list and not in the inputs list?And why x is a requirement?
from the source code,
self.add_update(updates, inputs)
updates: update op or list of update ops to add to the layer.
inputs: input tensor or list of inputs tensors to mark the updates as conditional on these inputs.If None is passed, the updates are assumed unconditional.
There are two types of weights:
Trainable = Updated automatically by the optimizer with backpropagation
Untrainable = Not updated by backpropagation
For the trainable weights, it's really not recommended to use updates, you will be mixing the optimizer's updates with your own updates and that could cause many issues
For the untrainable weights, you can do whatever you want. Sometimes you want constants and you will do nothing, sometimes, you want these weights to change (but not via backpropagation)
Notice how in that example the weights updated by the user are untrainable:
self.centers = self.add_weight(name='centers',
shape=(10, 2),
initializer='uniform',
#UNTRAINABLE
trainable=False)
But the user wants these weights to be updated following some rules. I don't know what they are doing there (didn't analyse the code), but I assume that they are calculating, for instance, something similar to the center point of a group of images, and each batch will have this center in a different position. They want to update this position.
A classical example is the BatchNormalization layer. Besides having trainable scale and bias weights used to rescale the outputs, they have the mean and variance weights. These are statistical properties of the data that need to be updated with every batch.
You are not training the "mean" or the "variance", but each batch of data updates these values.
How does it work?
This is obscure and lies deep down in Keras code.
We need the update operation so we make sure self.centers will have new values for every batch, otherwise it won't.
We use self.add_update in a layer to register that this variable should be updated. (We do similar things in custom optimizers as well, the optimizers contain the updates to the weights made via backpropagation)
Later in the source code for training the model, Keras will collect all these registered updates and make a train function. Somewhere inside this, these updates will be applied to the vars:
#inside a training function from keras
with K.name_scope('training'):
with K.name_scope(self.optimizer.__class__.__name__):
training_updates = self.optimizer.get_updates(
params=self._collected_trainable_weights,
loss=self.total_loss)
updates = (self.updates + #probably the updates registered in layers
training_updates + #the updates registered in optimizers
self.metrics_updates) #don't know....
# Gets loss and metrics. Updates weights at each call.
self.train_function = K.function(
inputs,
[self.total_loss] + self.metrics_tensors,
updates=updates,
name='train_function',
**self._function_kwargs)
Related
I want to implement a custom optimization algorithm for TF models.
I have read the following sources
tf documentation on custom optimizers
tf SGD implementation
keras documentation on custom models
towardsdatascience guide on custom optimizers
However lot of questions remain.
It seems like it is not possible to evaluate the loss function multiple times (for different weight settings) before applying a gradient step, when using the custom optimizer API. For example in a line-search type of algorithm this is necessary.
I tried to do all steps manually.
Assume I have setup my model and my optimization problem like this
from tensorflow.keras import layers
from tensorflow.keras import losses
from tensorflow.keras import models
model = models.Sequential()
model.add(layers.Dense(15, input_dim=10))
model.add(layers.Dense(20))
model.add(layers.Dense(1))
x_train, y_train = get_train_data()
loss = losses.MeanSquaredError()
def val_and_grads(weights):
model.set_weights(weights)
with tf.GradientTape() as tape:
val = loss(y_train, model(x_train))
grads = tape.gradient(val, model.trainable_variables)
return val, grads
initial_weights = model.get_weights()
optimal_weigths = my_fancy_optimization_algorithm(val_and_grads, initial_weights)
However my function val_and_grads needs a list of weights and returns a list of gradients from my_fancy_optimization_algorithms point of view that seems unnatural.
I could warp val_and_grads to "stack" the returned gradients and "split" the passed weights like this
def wrapped_val_and_grad(weights):
grads = val_and_grads(split_weights(weights))
return stack_grads(grads)
however that seems very inefficient.
Anyway, I do not like this approach since it seems that I would loose out out on a lot of the surrounding tensorflow infrastructure (printing of current loss function values and metrics during learning, tensorboard stuff, ...).
I could also pack the above in a custom model with a tailored train_step like this
def CustomModel(keras.Model):
def train_step(self, data):
x_train, y_train = data
def val_and_grads(weights):
self.set_weights(weights)
with tf.GradientTape() as tape:
val = loss(y_train, self(x_train))
grads = tape.gradient(val, self.trainable_variables)
return val, grads
trainable_vars = self.trainable_variables
old_weights = self.get_weights()
update = my_fancy_update_finding_algorithm(val_and_grads, self.get_weights()) # this can do multiple evaluations of the model
self.set_weights(old_weights) # restore the weights
self.optimizer.apply_gradients(zip(update, trainable_vars))
Here I would need a accompanying custom optimizer that does nothing else than updating the current weights by adding the update (new_weigths = current_weights + update).
I am still unsure if this is the best way to go.
If someone can comment on the snippets and ideas above, guide me to any other resource that I should consider or provide new approaches and other feedback I would be very glad.
Thanks all.
Franz
EDIT:
Sadly I did not get any response here so far. Maybe my question is not concrete enough. As a first smaller question:
Given the model and val_and_grads in the first listing. How would I efficiently calculate the norm of the WHOLE gradient? What I do so far is
import numpy as np
_, grads = val_and_grad(model.get_weights())
norm_grads = np.linalg.norm(np.concatenate([grad.numpy().flatten() for grad in grad]))
This surely cannot be the "right" way.
I want to create a network with two parallel layers (same input is given to two different layers and output of them is combined with some mathematical operations). Having said that, I am not sure the back-propagation will be done by Keras automatically. As a simple example of custom RNN cell,
class Example(keras.layers.Layer):
def __init__(self, units, **kwargs):
super(Example, self).__init__(**kwargs)
self.units = units
self.state_size = units
self.la = keras.layers.Dense(self.units)
self.lb = keras.layers.Dense(self.units)
def call(self, inputs, states):
prev_output = states[0]
# parallel layers
a = tf.sigmoid(self.la(inputs))
b = tf.sigmoid(self.lb(inputs))
# combined using mathematical operation
output = (-1 * prev_output * a) + (prev_output * b)
return output, [output]
Now, the loss gradient to `la` and `lb` layers are different (gradient of loss wrt `a`, should be `-output` but wrt `b` should be `output`), will this be taken care by Keras automatically or should we create custom gradient functions?
Any insights and suggestions are much appreciated :)
Check the answer by Daniel Möller
Back-propagation will be taken care by Keras as long as all calculation is linked by tensor object, i.e don't cast tensor to another type like array, so don't worry about it.
For example of gradient tape, you can check the gradient at each layer by:
gradients = grad_tape.gradient(total_loss, model.trainable_variables)
gradient_of_last_layer = tf.reduce_max(gradients[-1])
I want to add to my model a layer that, during evaluation, takes the input, applies some transformations (a quantization in this case, but can be whatever) and return it as the output. This layer must, however, be completely transparent during training, meaning that it must return the same input tensor.
I have written the following function
from keras.layers import Lambda
import keras.backend as K
def myquantize(x):
return K.in_test_phase( K.clip(K.round(x*(2**5))/(2**5),-3.9,3.9) , x)
which I then use via a Lambda layer
y = keras.layers.Conv1D(**args1)
y = keras.layers.AveragePooling1D(pool_size=2)(y)
y = keras.layers.Lambda(myquantize)(y)
y = keras.layers.Conv1D(**args2)
#...
Now, in principle the K.in_test_phase should return x during training, and that expression during test.
However, training the network with such layer prevent the network from learning (i.e. the train loss stops decreasing after 3 epochs), while if I remove it the network keeps training normally. I assume this layer is not actually transparent during training as expected.
in_test_phase has a parameter of training which you can explicitly set to indicate whether you are training or not. If you don't set it explicitly, then the value of learning_phase is used. This value keeps changing when you reset the graph or when you call different types of fit/predict/evaluate functions of model.
Since your full code isn't present, you can make use of training parameter. Set it to True during training. Then save the weights of the model using save_weights function of model. When you wish to test your model, set the training parameter to False. Then load the weights using load_weights function and then you can proceed accordingly.
For those who are in a similar situation, I created a custom layer like the following, which I only use during training:
class MyLayer(keras.layers.Layer):
def __init__(self, **kwargs):
super(MyLayer, self).__init__(**kwargs)
def compute_output_shape(self, input_shape):
return input_shape
def call(self, inputs, **kwargs):
x=inputs
return K.identity(x)
note that this layer always returns the input tensor, but it serves as 'placeholder' for the next step. On the evaluation part of the code, I wrote the following code:
class MyLayer(keras.layers.Layer):
def __init__(self, **kwargs):
super(MyLayer, self).__init__(**kwargs)
def compute_output_shape(self, input_shape):
return input_shape
def call(self, inputs, **kwargs):
x=inputs
return #Your actual processing here
Here, the only difference is that you actually perform the desired processing steps on your tensor. When I load my stored model, I pass this class as custom object
model = keras.models.load_model(model_file,custom_objects={'MyLayer':MyLayer})
be careful to pass as MyLayer the one where the actual processing is performed.
This is my solution, other suggestions are welcome
I'm training a language model in Keras and would like to speed up training by using sampled softmax as the final activation function in my network. From the TF docs, it looks like I need to supply arguments for weights and biases, but I'm unsure of what is expected as input for these. It seems like I could write a custom function in Keras as follows:
import keras.backend as K
def sampled_softmax(weights, biases, y_true, y_pred, num_sampled, num_classes):
return K.sampled_softmax(weights, biases, y_true, y_pred, num_sampled, num_classes)
However, I'm unsure of how to "plug this in" to my existing network. The architecture for the LM is pretty dead-simple:
model = Sequential()
model.add(Embedding(input_dim=len(vocab), output_dim=256))
model.add(LSTM(1024, return_sequence=True))
model.add(Dense(output_dim=len(vocab), activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')
Given this architecture, could I pass the sampled_softmax function as the loss argument when calling the compile method on the model? Or do this need to be written as a layer that comes after the final fully-connected layer. Any guidance here would be greatly appreciated. Thanks.
The key observation here is that the TensorFlow sampled softmax function returns actual losses, not a set of predictions over the set of possible labels to compare with the ground truth data to then compute losses as a separate step. This makes the model setup a little bit weird.
First, we add a second input layer to the model that encodes the target (training) data a second time as an input, in addition to being the target output. This is used for the labels argument of the sampled_softmax_loss function. It needs to be a Keras input, because it's treated as an input when we go to instantiate and set up the model.
Second, we construct a new custom Keras layer that calls the sampled_softmax_loss function with two Keras layers as its inputs: the output of the dense layer that predicts our classes, and then the second input that contains a copy of the training data. Note that we're doing some serious hackery accessing the _keras_history instance variable to fetch the weight and bias tensors from the output tensor of the original fully-connected layer.
Finally, we have to construct a new "dumb" loss function that ignores the training data and just uses the loss reported by the sampled_softmax_loss function.
Note that because the sampled softmax function returns losses, not class predictions, you can't use this model specification for validation or inference. You'll need to re-use the trained layers from this "training version" in a new specification that applies a standard softmax function to the original dense layer which has the default activation function applied.
There is definitely a more elegant way to do this, but I believe this works, so I figured I'd post it here now as-is rather than wait until I have something that's a little bit neater. For example, you'd probably want to make the number of classes an argument of the SampledSoftmax layer, or better yet, condense this all into the loss function as in the original question and avoid passing in the training data twice.
from keras.models import Model
from keras.layers import Input, Dense, Layer
from keras import backend as K
class SampledSoftmax(Layer):
def __init__(self, **kwargs):
super(SampledSoftmax, self).__init__(**kwargs)
def call(self, inputs):
"""
The first input should be the model as it were, and the second the
target (i.e., a repeat of the training data) to compute the labels
argument
"""
# the labels input to this function is batch size by 1, where the
# value at position (i, 1) is the index that is true (not zero)
# e.g., (0, 0, 1) => (2) or (0, 1, 0, 0) => (1)
return K.tf.nn.sampled_softmax_loss(weights=inputs[0]._keras_history[0].weights[0],
biases=inputs[0]._keras_history[0].bias,
inputs=inputs[0],
labels=K.tf.reshape(K.tf.argmax(inputs[1], 1), [-1, 1]),
num_sampled=1000,
num_classes=200000)
def custom_loss(y_true, y_pred):
return K.tf.reduce_mean(y_pred)
num_classes = 200000
input = Input(shape=(300,))
target_input = Input(shape=(num_classes,))
dense = Dense(num_classes)
outputs = dense(input)
outputs = SampledSoftmax()([outputs, target_input])
model = Model([input, target_input], outputs)
model.compile(optimizer=u'adam', loss=custom_loss)
# train as desired
I am working on language modelling and the vocabulary is large. So I want to use sampled_softmax_loss from tensorflow. The problem is that weights and biases which are the arguments of the sampled_softmax_loss function seems not trainable (their values don't change after training)
So I guess that I should add them to the computation graph building automatically by keras Model, but I spent a lot of time and still haven't find a proper way to do so.
So, once again. I want to add external trainable tf.Variables to the keras computation graph. Does anyone know the method to do so?
my model (head and tail)
input_sentence = Input(shape=(INPUT_LENGTH,), dtype='int32')
words = Embedding(embedding_matrix.shape[0], embedding_matrix.shape[1],
weights=[embedding_matrix], trainable=True)(input_sentence)
...
context = Dense(256, activation='tanh')(context)
model = Model(inputs=input_sentence, outputs=context, name=name)
loss
def softmax_fine_loss(labels, logits, transposed_W=None, b=None):
res = tf.map_fn(lambda (__labels, __logits): tf.nn.sampled_softmax_loss(transposed_W, b, __labels, __logits,
num_sampled=1000, num_classes=OUTPUT_COUNT+1),
(labels, logits), dtype=tf.float32)
return res
loss = lambda labels, logits: softmax_fine_loss(labels, logits, transposed_W=transposed_W, b=b)
model_truncated.compile(optimizer=optimizer, loss=loss, sample_weight_mode='temporal')
I have finally found a workaround
Let's say we need to train weights W and biases b with our model.
So the workaround is just add them to one of the trainable layers of our model.
model.layers[-1].trainable_weights.extend([W, b])
When we can compile the model
model.compile(...)
It is extremely important to add variables to trainable layer, for example I've experimented with Sequential model, and adding [W, b] to the Activation layer does not make them actually trainable.