How can I use TensorFlow's sampled softmax loss function in a Keras model? - tensorflow

I'm training a language model in Keras and would like to speed up training by using sampled softmax as the final activation function in my network. From the TF docs, it looks like I need to supply arguments for weights and biases, but I'm unsure of what is expected as input for these. It seems like I could write a custom function in Keras as follows:
import keras.backend as K
def sampled_softmax(weights, biases, y_true, y_pred, num_sampled, num_classes):
return K.sampled_softmax(weights, biases, y_true, y_pred, num_sampled, num_classes)
However, I'm unsure of how to "plug this in" to my existing network. The architecture for the LM is pretty dead-simple:
model = Sequential()
model.add(Embedding(input_dim=len(vocab), output_dim=256))
model.add(LSTM(1024, return_sequence=True))
model.add(Dense(output_dim=len(vocab), activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')
Given this architecture, could I pass the sampled_softmax function as the loss argument when calling the compile method on the model? Or do this need to be written as a layer that comes after the final fully-connected layer. Any guidance here would be greatly appreciated. Thanks.

The key observation here is that the TensorFlow sampled softmax function returns actual losses, not a set of predictions over the set of possible labels to compare with the ground truth data to then compute losses as a separate step. This makes the model setup a little bit weird.
First, we add a second input layer to the model that encodes the target (training) data a second time as an input, in addition to being the target output. This is used for the labels argument of the sampled_softmax_loss function. It needs to be a Keras input, because it's treated as an input when we go to instantiate and set up the model.
Second, we construct a new custom Keras layer that calls the sampled_softmax_loss function with two Keras layers as its inputs: the output of the dense layer that predicts our classes, and then the second input that contains a copy of the training data. Note that we're doing some serious hackery accessing the _keras_history instance variable to fetch the weight and bias tensors from the output tensor of the original fully-connected layer.
Finally, we have to construct a new "dumb" loss function that ignores the training data and just uses the loss reported by the sampled_softmax_loss function.
Note that because the sampled softmax function returns losses, not class predictions, you can't use this model specification for validation or inference. You'll need to re-use the trained layers from this "training version" in a new specification that applies a standard softmax function to the original dense layer which has the default activation function applied.
There is definitely a more elegant way to do this, but I believe this works, so I figured I'd post it here now as-is rather than wait until I have something that's a little bit neater. For example, you'd probably want to make the number of classes an argument of the SampledSoftmax layer, or better yet, condense this all into the loss function as in the original question and avoid passing in the training data twice.
from keras.models import Model
from keras.layers import Input, Dense, Layer
from keras import backend as K
class SampledSoftmax(Layer):
def __init__(self, **kwargs):
super(SampledSoftmax, self).__init__(**kwargs)
def call(self, inputs):
"""
The first input should be the model as it were, and the second the
target (i.e., a repeat of the training data) to compute the labels
argument
"""
# the labels input to this function is batch size by 1, where the
# value at position (i, 1) is the index that is true (not zero)
# e.g., (0, 0, 1) => (2) or (0, 1, 0, 0) => (1)
return K.tf.nn.sampled_softmax_loss(weights=inputs[0]._keras_history[0].weights[0],
biases=inputs[0]._keras_history[0].bias,
inputs=inputs[0],
labels=K.tf.reshape(K.tf.argmax(inputs[1], 1), [-1, 1]),
num_sampled=1000,
num_classes=200000)
def custom_loss(y_true, y_pred):
return K.tf.reduce_mean(y_pred)
num_classes = 200000
input = Input(shape=(300,))
target_input = Input(shape=(num_classes,))
dense = Dense(num_classes)
outputs = dense(input)
outputs = SampledSoftmax()([outputs, target_input])
model = Model([input, target_input], outputs)
model.compile(optimizer=u'adam', loss=custom_loss)
# train as desired

Related

Using training weights on a non-training data to design a new loss function

I would like to access the training point(s) at a training iteration and incorporate a soft constraint into my loss function by using data points not included in the training set. I will use this post as a reference.
import numpy as np
import keras.backend as K
from keras.layers import Dense, Input
from keras.models import Model
# Some random training data and labels
features = np.random.rand(100, 5)
labels = np.random.rand(100, 2)
# Simple neural net with three outputs
input_layer = Input((20,))
hidden_layer = Dense(16)(input_layer)
output_layer = Dense(3)(hidden_layer)
# Model
model = Model(inputs=input_layer, outputs=output_layer)
#each training point has another data pair. In the real example, I will have multiple
#supporters. That is why I am using dict.
holder = np.random.rand(100, 5)
iter = np.arange(start=1, stop=features.shape[0], step=1)
supporters = {}
for i,j in zip(iter, holder): #i represent the ith training data
supporters[i]=j
# Write a custom loss function
def custom_loss(y_true, y_pred):
# Normal MSE loss
mse = K.mean(K.square(y_true-y_pred), axis=-1)
new_constraint = ....
return(mse+new_constraint)
model.compile(loss=custom_loss, optimizer='sgd')
model.fit(features, labels, epochs=1, ,batch_size=1=1)
For simplicity, let us assume that I'd like to minimize the minimum absolute value difference between the prediction value and the prediction of the pair data stored in supporters by using the fixed network weights. Also, assume that I pass one training point at each batch. However, I could not figure out how to perform this opeartion. I've tried something shown below, but clearly, it is not correct.
new_constraint = K.sum(y_pred - model.fit(supporters))
Fit is the procedure of training evaluating the model. I think that it would be better for your problem to load a new instance of your model with your current weights and evaluate the batch loss in order to calculate the loss of the main model.
main_model = Model() # This is your main training model
def custom_loss_1(y_true, y_pred): # Avoid recursive calls
mse = K.mean(K.square(y_true-y_pred), axis=-1)
return mse
def custom_loss(y_true, y_pred):
support_model = tf.keras.models.clone_model(main_model) # You copy the main model but the weights are uninitialized
support_model.build((20,)) # You build with inputs same as your support data
support_model.compile(loss=custom_loss_1, optimizer='sgd')
support_model.set_weights(main_model.get_weights()) # You load the weight of the main model
mse = custom_loss_1(y_true, y_pred)
# You just want to evaluate the model, not to train. If you have more
# metrics than just loss the use support_model.evaluate(supporters)[0]
new_constraint = K.sum(y_pred - support_model.predict(supporters)) # predict to get the output, evaluate to get the metrics
return(mse+new_constraint)

How to build a Neural Network in Keras using a custom loss function with datapoint-specific weight?

I want to train a Neural Network for a classification task in Keras using a TensorFlow backend with a custom loss function. In my loss, I want to give different weights to different training examples. I have some datapoints I consider important and some I do not consider as important. I want my loss function to take this into account and punish errors in important examples more than in less important ones.
I have already built my model:
input = tf.keras.Input(shape=(16,))
hidden_layer_1 = tf.keras.layers.Dense(5, kernel_initializer='glorot_uniform', activation='relu')(input)
output = tf.keras.layers.Dense(1, kernel_initializer='normal', activation='softmax')(hidden_layer_1)
model = tf.keras.Model(input, output)
model.compile(loss=custom_loss(input), optimizer='adam', run_eagerly=True, metrics = [tf.keras.metrics.Accuracy(), 'acc'])
and the currrent state of my loss function is:
def custom_loss(input):
def loss(y_true, y_pred):
return ...
return loss
I'm struggling with implementing the loss function in the way I explained above, mainly because I don't exactly know what input, y_pred and y_true are (KerasTensors, I know - but what is the content? And is it for one training example only or for the whole batch?). I'd appreciate help with
printing out the values of input, y_true and y_pred
converting the input value to a numpy ndarray ([1,3,7] for example) so I can use the array to look up my weight for this specific training data point
once I have my weigth as a number (0.5 for example), how do I implement the computation of the loss function in Keras? My loss for one training exaple should be 0 if the classification was correct and weight if it was incorrect.

Keras remove activation function of last layer

I want to use ResNet50 with Imagenet weights.
The last layer of ResNet50 is (from here)
x = layers.Dense(1000, activation='softmax', name='fc1000')(x)
I need to keep the weights of this layer but remove the softmax function.
I want to manually change it so my last layer looks like this
x = layers.Dense(1000, name='fc1000')(x)
but the weights stay the same.
Currently I call my net like this
resnet = Sequential([
Input(shape(224,224,3)),
ResNet50(weights='imagenet', input_shape(224,224,3))
])
I need the Input layer because otherwise the model.compile says that placeholders aren't filled.
Generally there are two ways of achievieng this:
Quick way - supported functions:
To change the final layer's activation function, you can pass an argument classifier_activation.
So in order to get rid of activation all together, your module can be called like:
import tensorflow as tf
resnet = tf.keras.Sequential([
tf.keras.layers.Input(shape=(224,224,3)),
tf.keras.applications.ResNet50(
weights='imagenet',
input_shape=(224,224,3),
pooling="avg",
classifier_activation=None
)
])
This however, is not going to work if the you want a different function, that is not supported by Keras classifer_activation parameter (e. g. custom activation function).
To achieve this you can use the workaround solution:
Long way - copy the model's weights
This solution proposes copying the original model's weights onto your custom one. This approach works because apart from the activation function you are not chaning the model's architecture.
You need to:
1. Download original model.
2. Save it's weights.
3. Declare your modified version of the model (in your case, without the activation function).
4. Set the weights of the new model.
Below snippet explains this concept in more detail:
import tensorflow as tf
# 1. Download original resnet
resnet = tf.keras.Sequential([
tf.keras.layers.Input(shape=(224,224,3)),
tf.keras.applications.ResNet50(
weights='imagenet',
input_shape=(224,224,3),
pooling="avg"
)
])
# 2. Hold weights in memory:
imagenet_weights = resnet.get_weights()
# 3. Declare the model, but without softmax
resnet_no_softmax = tf.keras.Sequential([
tf.keras.layers.Input(shape=(224,224,3)),
tf.keras.applications.ResNet50(
include_top=False,
weights='imagenet',
input_shape=(224,224,3),
pooling="avg"
),
tf.keras.layers.Dense(1000, name='fc1000')
])
# 4. Pass the imagenet weights onto the second resnet
resnet_no_softmax.set_weights(imagenet_weights)
Hope this helps!

Tensorflow 2.0 Custom loss function with multiple inputs

I am trying to optimize a model with the following two loss functions
def loss_1(pred, weights, logits):
weighted_sparse_ce = kls.SparseCategoricalCrossentropy(from_logits=True)
policy_loss = weighted_sparse_ce(pred, logits, sample_weight=advantages)
and
def loss_2(y_pred, y):
return kls.mean_squared_error(y_pred, y)
however, because TensorFlow 2 expects loss function to be of the form
def fn(y_pred, y_true):
...
I am using a work-around for loss_1 where I pack pred and weights into a single tensor before passing to loss_1 in the call to model.fit and then unpack them in loss_1. This is inelegant and nasty because pred and weights are of different data types and so this requires an additional cast, pack, un-pack and un-cast each time I call model.fit.
Furthermore, I am aware of the sample_weight argument to fit, which is kind of like the solution to this question. This might be a workable solution were it not for the fact that I am using two loss functions and I only want the sample_weight applied to one of them. Also, even if this were a solution, would it not be generalizable to other types of custom loss functions.
All that being said, my question, said concisely, is:
What is the best way to create a loss function with an arbitrary number of
arguments in TensorFlow 2?
Another thing I have tried is passing a tf.tuple but that also seems to violate TensorFlow's desires for a loss function input.
This problem can be easily solved using custom training in TF2. You need only compute your two-component loss function within a GradientTape context and then call an optimizer with the produced gradients. For example, you could create a function custom_loss which computes both losses given the arguments to each:
def custom_loss(model, loss1_args, loss2_args):
# model: tf.model.Keras
# loss1_args: arguments to loss_1, as tuple.
# loss2_args: arguments to loss_2, as tuple.
with tf.GradientTape() as tape:
l1_value = loss_1(*loss1_args)
l2_value = loss_2(*loss2_args)
loss_value = [l1_value, l2_value]
return loss_value, tape.gradient(loss_value, model.trainable_variables)
# In training loop:
loss_values, grads = custom_loss(model, loss1_args, loss2_args)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
In this way, each loss function can take an arbitrary number of eager tensors, regardless of whether they are inputs or outputs to the model. The sets of arguments to each loss function need not be disjoint as shown in this example.
To expand on Jon's answer. In case you want to still have the benefits of a Keras Model you can expand the model class and write your own custom train_step:
from tensorflow.python.keras.engine import data_adapter
# custom loss function that takes two outputs of the model
# as input parameters which would otherwise not be possible
def custom_loss(gt, x, y):
return tf.reduce_mean(x) + tf.reduce_mean(y)
class CustomModel(keras.Model):
def compile(self, optimizer, my_loss):
super().compile(optimizer)
self.my_loss = my_loss
def train_step(self, data):
data = data_adapter.expand_1d(data)
input_data, gt, sample_weight = data_adapter.unpack_x_y_sample_weight(data)
with tf.GradientTape() as tape:
y_pred = self(input_data, training=True)
loss_value = self.my_loss(gt, y_pred[0], y_pred[1])
grads = tape.gradient(loss_value, self.trainable_variables)
self.optimizer.apply_gradients(zip(grads, self.trainable_variables))
return {"loss_value": loss_value}
...
model = CustomModel(inputs=input_tensor0, outputs=[x, y])
model.compile(optimizer=tf.keras.optimizers.Adam(), my_loss=custom_loss)
In tf 1.x we have tf.nn.weighted_cross_entropy_with_logits function which allows us trade off recall and precision by adding extra positive weights for each class. In multi-label classification, it should be a (N,) tensor or numpy array. However, in tf 2.0, I haven't found similar loss functions yet, so I wrote my own loss function with extra arguments pos_w_arr.
from tensorflow.keras.backend import epsilon
def pos_w_loss(pos_w_arr):
"""
Define positive weighted loss function
"""
def fn(y_true, y_pred):
_epsilon = tf.convert_to_tensor(epsilon(), dtype=y_pred.dtype.base_dtype)
_y_pred = tf.clip_by_value(y_pred, _epsilon, 1. - _epsilon)
cost = tf.multiply(tf.multiply(y_true, tf.math.log(
_y_pred)), pos_w_arr)+tf.multiply((1-y_true), tf.math.log(1-_y_pred))
return -tf.reduce_mean(cost)
return fn
Not sure what do you mean it wouldn't work when using eager tensors or numpy array as inputs though. Please correct me if I'm wrong.

Keras dense layer outputs are 'nan'

I'm using Keras to build a RNN model with CTC loss.
I found that when passed a tensor to a Dense layer with activation=None, and the outputs of this layer were all nan.
But when set activation='softmax', the outputs were normal not nan.
problem code (elements of logits are all nan):
logits = Dense(out_shape, activation = None, name="logits")(x_permute)#x_permute is a tensor with shape (?,1876,96)
loss_ctc = Lambda(ctc_lambda_func, name='ctc_my')(
[logits, labels, x_len, lab_len])
model = Model(inputs=[x, labels, x_len, lab_len], outputs=[loss_ctc])
model.compile(loss={'ctc_my': lambda y_true,y_pred: y_pred}, optimizer='adadelta')
normal code(elements of logits are not nan):
logits = Dense(out_shape, activation = None, name="logits")(x_permute)#x_permute is a tensor with shape (?,1876,96)
output = Activation(activation="softmax", name="softmax")(logits)
loss_ctc = Lambda(ctc_lambda_func, name='ctc_my')(
[output, labels, x_len, lab_len])
model = Model(inputs=[x, labels, x_len, lab_len], outputs=[loss_ctc])
model.compile(loss={'ctc_my': lambda y_true,y_pred: y_pred}, optimizer='adadelta')
def ctc_lambda_func(args):
y_pred, y_true, input_length, label_length = args
return ctc_batch_cost(y_true, y_pred,input_length,label_length)
Anyone helps? many thanks.
I may misunderstand you, but why would you want activation="none"?
Maybe what you want to use is linear activation?
Have a look at Keras Activation Functions
as per Klemen Grm
your neural network is completely linear. You might consider different activation functions (eg: tanh, sigmoid, linear) for your hidden and output layers. This both lets you constrain the output range, and will probably improve the learning properties of your network.
In addition to what Klemen says, for the last one you want a softmax,
that normalizes the outputs into probabilities.
Neural networks have to implement complex mapping functions hence they need activation functions that are non-linear in order to bring in the much needed non-linearity property that enables them to approximate any function. A neuron without an activation function is equivalent to a neuron with a linear activation function