keras add external trainable variable to graph - tensorflow

I am working on language modelling and the vocabulary is large. So I want to use sampled_softmax_loss from tensorflow. The problem is that weights and biases which are the arguments of the sampled_softmax_loss function seems not trainable (their values don't change after training)
So I guess that I should add them to the computation graph building automatically by keras Model, but I spent a lot of time and still haven't find a proper way to do so.
So, once again. I want to add external trainable tf.Variables to the keras computation graph. Does anyone know the method to do so?
my model (head and tail)
input_sentence = Input(shape=(INPUT_LENGTH,), dtype='int32')
words = Embedding(embedding_matrix.shape[0], embedding_matrix.shape[1],
weights=[embedding_matrix], trainable=True)(input_sentence)
...
context = Dense(256, activation='tanh')(context)
model = Model(inputs=input_sentence, outputs=context, name=name)
loss
def softmax_fine_loss(labels, logits, transposed_W=None, b=None):
res = tf.map_fn(lambda (__labels, __logits): tf.nn.sampled_softmax_loss(transposed_W, b, __labels, __logits,
num_sampled=1000, num_classes=OUTPUT_COUNT+1),
(labels, logits), dtype=tf.float32)
return res
loss = lambda labels, logits: softmax_fine_loss(labels, logits, transposed_W=transposed_W, b=b)
model_truncated.compile(optimizer=optimizer, loss=loss, sample_weight_mode='temporal')

I have finally found a workaround
Let's say we need to train weights W and biases b with our model.
So the workaround is just add them to one of the trainable layers of our model.
model.layers[-1].trainable_weights.extend([W, b])
When we can compile the model
model.compile(...)
It is extremely important to add variables to trainable layer, for example I've experimented with Sequential model, and adding [W, b] to the Activation layer does not make them actually trainable.

Related

Custom loss function with regularization cost added in TensorFlow

I wrote a custom loss function that add the regularization loss to the total loss, I added L2 regularizer to kernels only, but when I called model.fit() a warning appeared which states that the gradients does not exist for those biases, and biases are not updated, also if I remove a regularizer from a kernel of one of the layers, the gradient for that kernel also does not exist.
I tried to add bias regularizer to each layer and everything worked correctly, but I don't want to regularize the biases, so what should I do?
Here is my loss function:
def _loss_function(y_true, y_pred):
# convert tensors to numpy arrays
y_true_n = y_true.numpy()
y_pred_n = y_pred.numpy()
# modify probablities for Knowledge Distillation loss
# we do this for old tasks only
old_y_true = np.float_power(y_true_n[:, :-1], 0.5)
old_y_true = old_y_true / np.sum(old_y_true)
old_y_pred = np.float_power(y_pred_n[:, :-1], 0.5)
old_y_pred = old_y_pred / np.sum(old_y_pred)
# Define the loss that we will used for new and old tasks
bce = tf.keras.losses.BinaryCrossentropy()
# compute the loss on old tasks
old_loss = bce(old_y_true, old_y_pred)
# compute the loss on new task
new_loss = bce(y_true_n[:, -1], y_pred_n[:, -1])
# compute the regularization loss
reg_loss = tf.compat.v1.losses.get_regularization_loss()
assert reg_loss is not None
# convert all tensors to float64
old_loss = tf.cast(old_loss, dtype=tf.float64)
new_loss = tf.cast(new_loss, dtype=tf.float64)
reg_loss = tf.cast(reg_loss, dtype=tf.float64)
return old_loss + new_loss + reg_loss
In keras, loss function should return the loss value without regularization losses. The regularization losses will be added automatically by setting kernel_regularizer or bias_regularizer in each of the keras layers.
In other words, when you write your custom loss function, you don't have to care about regularization losses.
Edit: the reason why you got the warning messages that gradients don't exist is because of the usage of numpy() in your loss function. numpy() will stop any gradient propagation.
The warning messages disappeared after you added regularizers to the layers do not imply that the gradients were then computed correctly. It would only include the gradients from the regularizers but not from the data. numpy() should be removed in the loss function in order to get the correct gradients.
One of the solutions is to keep everything in tensors and use tf.math library. e.g. use tf.pow to replace np.float_power and tf.reduce_sum to replace np.sum

Build layers with fixed weights in TensorFlow

I want to build a fully-connected (dense) layer for a regression task. I usually do it with TF2, using Keras API like:
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Dense(units=2, activation='sigmoid', input_shape=(1, )))
model.add(tf.keras.layers.Dense(units=2, activation='linear'))
model.compile(optimizer='adam', loss='mae')
model.fit(inp_data, out_data, epochs=1000)
Now I want to build a custom layer. The layer is composed of, say 10 units, in which 8 units have predefined, fixed, untrainable weights and biases and 2 units have randomly-chosen weights and biases, to be trained by the network. Has anyone any idea how can I define it in Tensorflow?
Keras layers may receive a trainable parameter, True by default, to indicate whether you want them to be trained. Non-trainable layers will just keep the value they are given by the initializer. If I understand correctly, you want to have one layer which is only partially trainable. That is not possible as such with existing layers. Maybe you could do it with a custom layer class, but you can have an equivalent behavior by using two simple layers and then concatenating them (as long as your activation works element-wise, and even it it doesn't, like in a softmax layer, you could apply that activation after the concatenation). This is how it could work:
inputs = tf.keras.Input(shape=(1,))
# This is the trainable part of the layer
layer_train = tf.keras.layers.Dense(units=8, activation='sigmoid')(inputs)
# This is the non-trainable part
layer_const = tf.keras.layers.Dense(units=2, activation='sigmoid', trainable=False)(inputs)
# Merge both parts
layer = tf.keras.layers.Concatenate()([layer_train, layer_const])
# Make model
model = tf.keras.Model(inputs=inputs, outputs=layer)
# ...

What is the correct way to implement a 'useless loss' with Keras?

I have a Keras model that has two outputs:
output is the true output of the network on which the loss is going to be computed
additional is used to make an external task during inference (no loss should be computed with this output)
When I build the model, I write something like that:
model = Model(inputs=inp, outputs=[output, additional])
Since my Model has two outputs, I need to provide two losses when compiling the model so I created a useless loss like this:
class NoopLoss(object):
def __call__(self, y_true, y_pred, **kwargs):
return self.compute_loss(y_true, y_pred)
def compute_loss(self, y_true, y_pred):
return tf.math.square(0.0)
Which I integrate in the compile step like this:
loss = UsefulLoss() # the real loss I'm using
noop_loss = NoopLoss()
model.compile(loss=[loss, noop_loss], optimizer=optimizer, metrics=['binary_accuracy'])
It works, but I feel it is a bit hackish, is there a correct way to implement this behavior? I didn't find any official useless loss in the Keras documentation.
In my opinion, Keras was not thought to consider things like this.
I often use these hacks myself too.
But, not sure it's a better solution, actually it might not be, you can create a training model and an inference model, both sharing the trainable part:
inputs = Input(...)
trainable_out = SomeLayer(...)(inputs)
....
trainable_out = ....
extra_output = SomeLayer(...)(something)
training_model = Model(inputs, trainable_out)
inference_model = Model(inputs, [trainable_out, extra_output])
You can train training_model and automatically the other model will be trained as well.

Custom layer updates

i want to create a custom layer with weights that update only in training phase.
from the official documentation this is the way:
from keras import backend as K
from keras.layers import Layer
class MyLayer(Layer):
def __init__(self, output_dim, **kwargs):
self.output_dim = output_dim
super(MyLayer, self).__init__(**kwargs)
def build(self, input_shape):
# Create a trainable weight variable for this layer.
self.kernel = self.add_weight(name='kernel',
shape=(input_shape[1], self.output_dim),
initializer='uniform',
trainable=True)
super(MyLayer, self).build(input_shape) # Be sure to call this at the end
def call(self, x):
return K.dot(x, self.kernel)
def compute_output_shape(self, input_shape):
return (input_shape[0], self.output_dim)
in this github repo
the author added
new_centers = self.centers - self.alpha * delta_centers
self.add_update((self.centers, new_centers), x)
where self.centers are the weights.
I cant understand why self.add_update is useful in that situation.
Weights are not updated if i dont add self.add_update? If not, why new_centers must be in the updates list and not in the inputs list?And why x is a requirement?
from the source code,
self.add_update(updates, inputs)
updates: update op or list of update ops to add to the layer.
inputs: input tensor or list of inputs tensors to mark the updates as conditional on these inputs.If None is passed, the updates are assumed unconditional.
There are two types of weights:
Trainable = Updated automatically by the optimizer with backpropagation
Untrainable = Not updated by backpropagation
For the trainable weights, it's really not recommended to use updates, you will be mixing the optimizer's updates with your own updates and that could cause many issues
For the untrainable weights, you can do whatever you want. Sometimes you want constants and you will do nothing, sometimes, you want these weights to change (but not via backpropagation)
Notice how in that example the weights updated by the user are untrainable:
self.centers = self.add_weight(name='centers',
shape=(10, 2),
initializer='uniform',
#UNTRAINABLE
trainable=False)
But the user wants these weights to be updated following some rules. I don't know what they are doing there (didn't analyse the code), but I assume that they are calculating, for instance, something similar to the center point of a group of images, and each batch will have this center in a different position. They want to update this position.
A classical example is the BatchNormalization layer. Besides having trainable scale and bias weights used to rescale the outputs, they have the mean and variance weights. These are statistical properties of the data that need to be updated with every batch.
You are not training the "mean" or the "variance", but each batch of data updates these values.
How does it work?
This is obscure and lies deep down in Keras code.
We need the update operation so we make sure self.centers will have new values for every batch, otherwise it won't.
We use self.add_update in a layer to register that this variable should be updated. (We do similar things in custom optimizers as well, the optimizers contain the updates to the weights made via backpropagation)
Later in the source code for training the model, Keras will collect all these registered updates and make a train function. Somewhere inside this, these updates will be applied to the vars:
#inside a training function from keras
with K.name_scope('training'):
with K.name_scope(self.optimizer.__class__.__name__):
training_updates = self.optimizer.get_updates(
params=self._collected_trainable_weights,
loss=self.total_loss)
updates = (self.updates + #probably the updates registered in layers
training_updates + #the updates registered in optimizers
self.metrics_updates) #don't know....
# Gets loss and metrics. Updates weights at each call.
self.train_function = K.function(
inputs,
[self.total_loss] + self.metrics_tensors,
updates=updates,
name='train_function',
**self._function_kwargs)

Keras dense layer outputs are 'nan'

I'm using Keras to build a RNN model with CTC loss.
I found that when passed a tensor to a Dense layer with activation=None, and the outputs of this layer were all nan.
But when set activation='softmax', the outputs were normal not nan.
problem code (elements of logits are all nan):
logits = Dense(out_shape, activation = None, name="logits")(x_permute)#x_permute is a tensor with shape (?,1876,96)
loss_ctc = Lambda(ctc_lambda_func, name='ctc_my')(
[logits, labels, x_len, lab_len])
model = Model(inputs=[x, labels, x_len, lab_len], outputs=[loss_ctc])
model.compile(loss={'ctc_my': lambda y_true,y_pred: y_pred}, optimizer='adadelta')
normal code(elements of logits are not nan):
logits = Dense(out_shape, activation = None, name="logits")(x_permute)#x_permute is a tensor with shape (?,1876,96)
output = Activation(activation="softmax", name="softmax")(logits)
loss_ctc = Lambda(ctc_lambda_func, name='ctc_my')(
[output, labels, x_len, lab_len])
model = Model(inputs=[x, labels, x_len, lab_len], outputs=[loss_ctc])
model.compile(loss={'ctc_my': lambda y_true,y_pred: y_pred}, optimizer='adadelta')
def ctc_lambda_func(args):
y_pred, y_true, input_length, label_length = args
return ctc_batch_cost(y_true, y_pred,input_length,label_length)
Anyone helps? many thanks.
I may misunderstand you, but why would you want activation="none"?
Maybe what you want to use is linear activation?
Have a look at Keras Activation Functions
as per Klemen Grm
your neural network is completely linear. You might consider different activation functions (eg: tanh, sigmoid, linear) for your hidden and output layers. This both lets you constrain the output range, and will probably improve the learning properties of your network.
In addition to what Klemen says, for the last one you want a softmax,
that normalizes the outputs into probabilities.
Neural networks have to implement complex mapping functions hence they need activation functions that are non-linear in order to bring in the much needed non-linearity property that enables them to approximate any function. A neuron without an activation function is equivalent to a neuron with a linear activation function