In Tensorflow, what is the best way to add weight decay to a model? - tensorflow

I am trying to add weight decay (aka L2 regularization) to my model. I'm seeing two ways to do this but not sure which one is correct.
Defined in the keras layers as a kernel_regularizer:
model = tf.keras.Sequential(
[
tf.keras.Input(shape=input_shape),
tf.keras.layers.Rescaling(scale=1.0 / 255),
hub.KerasLayer(HUB_URL, trainable=False),
tf.keras.layers.Dropout(rate=dropout_rate),
tf.keras.layers.Dense(
num_classes,
activation="softmax",
kernel_regularizer=tf.keras.regularizers.l2(weight_decay),
),
]
)
Defined in the loss function as weight_decay:
tf.keras.optimizers.Adam(learning_rate=learning_rate, weight_decay=weight_decay)
Where should I set the weight decay. Should it be in both?

Related

weighted loss function for multilabel classification

I am working on multilabel classification problem for images. I have 5 classes and I am using sigmoid for the last layer of classification. I have imbalanced data caused by multilabel problem and I thought I can use:
tf.nn.weighted_cross_entropy_with_logits( labels, logits, pos_weight, name=None)
However I don't know how to get logits from my model. I also think I shouldn't use sigmoid in the last layer since this loss function applies sigmoid to the logit.
First of all I suggest you have a look at the TensorFlow tutorial for classification on imbalanced dataset. However keep in mind that this tutorial is for binary classification and uses a sigmoid as last dense layer activation function. For multi-label classification you should use a softmax activation.
The softmax function normalizes a set of N real numbers into a probability distribution such that they sum up to 1.
For K = 2, the softmax and sigmoid function are the same.
I don't know your model, but you could create something like this (following the tutorial):
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10, activation=None)
])
To obtain the predictions you could do:
predictions = model(x_train[:1]).numpy() # obtains the prediction logits
tf.nn.softmax(predictions).numpy() # converts the logits to probabilities
In order to train you can define the following loss, compile the model, and train:
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer='adam',
loss=loss_fn,
metrics=['accuracy'])
model.fit(x_train, y_train, epochs=5)
Now, since you have an imbalanced dataset, in order to add weights, if you look at the documentation of SparseCategoricalCrossEntropy, you can see that the __call__ method has an optional parameter sample_weights:
Optional sample_weight acts as a coefficient for the loss. If a scalar
is provided, then the loss is simply scaled by the given value. If
sample_weight is a tensor of size [batch_size], then the total loss
for each sample of the batch is rescaled by the corresponding element
in the sample_weight vector.
I suggest you have a look at this answer if you have doubts on how to proceed. I think it answers perfectly what you want to achieve.
Also I find that this tutorial explains pretty well the multi-label classification problem.

cost function after converting tf.layers to tf.keras.layers

I have a CNN where output dimension is [None, 10]
It is a multi-label problem, where output signifies possible categories which x might belong. (eg, an image can be classified as cat dark and so on)
Following is what I have now, how can I change the code to keras version?
I can't find equivalent of sigmoid_cross_entropy_with_logits
model = tf.layers.dense(L3, category_num, activation=None)
cross_entropy = tf.nn.sigmoid_cross_entropy_with_logits(logits=model, labels=Y)
cost = tf.reduce_mean(tf.reduce_sum(cross_entropy, axis=1))
optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)
Direct alternative in Keras is to use sigmoid activation in your output layer and binary_crossentropy as cost function.
net.add(Dense(..., activation='sigmoid'))
net.compile(optimizer, loss='binary_crossentropy')
Take a look https://github.com/keras-team/keras/issues/741
In Keras:
#you model here -- last layer:
model.add(Dense(10))
model.add(Activation('sigmoid'))
model.compile(loss='categorical_crossentropy',
optimizer="adam",metrics=['accuracy'])

keras add external trainable variable to graph

I am working on language modelling and the vocabulary is large. So I want to use sampled_softmax_loss from tensorflow. The problem is that weights and biases which are the arguments of the sampled_softmax_loss function seems not trainable (their values don't change after training)
So I guess that I should add them to the computation graph building automatically by keras Model, but I spent a lot of time and still haven't find a proper way to do so.
So, once again. I want to add external trainable tf.Variables to the keras computation graph. Does anyone know the method to do so?
my model (head and tail)
input_sentence = Input(shape=(INPUT_LENGTH,), dtype='int32')
words = Embedding(embedding_matrix.shape[0], embedding_matrix.shape[1],
weights=[embedding_matrix], trainable=True)(input_sentence)
...
context = Dense(256, activation='tanh')(context)
model = Model(inputs=input_sentence, outputs=context, name=name)
loss
def softmax_fine_loss(labels, logits, transposed_W=None, b=None):
res = tf.map_fn(lambda (__labels, __logits): tf.nn.sampled_softmax_loss(transposed_W, b, __labels, __logits,
num_sampled=1000, num_classes=OUTPUT_COUNT+1),
(labels, logits), dtype=tf.float32)
return res
loss = lambda labels, logits: softmax_fine_loss(labels, logits, transposed_W=transposed_W, b=b)
model_truncated.compile(optimizer=optimizer, loss=loss, sample_weight_mode='temporal')
I have finally found a workaround
Let's say we need to train weights W and biases b with our model.
So the workaround is just add them to one of the trainable layers of our model.
model.layers[-1].trainable_weights.extend([W, b])
When we can compile the model
model.compile(...)
It is extremely important to add variables to trainable layer, for example I've experimented with Sequential model, and adding [W, b] to the Activation layer does not make them actually trainable.

tensorflow tutorial of convolution, scale of logit

I am trying to edit my own model by adding some code to cifar10.py and here is the question.
In cifar10.py, the [tutorial][1] says:
EXERCISE: The output of inference are un-normalized logits. Try editing the network architecture to return normalized predictions using tf.nn.softmax().
So I directly input the output from "local4" to tf.nn.softmax(). This gives me the scaled logits which means the sum of all logits is 1.
But in the loss function, the cifar10.py code uses:
tf.nn.sparse_softmax_cross_entropy_with_logits()
and description of this function says
WARNING: This op expects unscaled logits, since it performs a softmax on logits internally for efficiency. Do not call this op with the output of softmax, as it will produce incorrect results.
Also, according to the description, logits as input to above funtion must have the shape [batch_size, num_classes] and it means logits should be unscaled softmax, like sample code calculate unnormalized softmaxlogit as follow.
# softmax, i.e. softmax(WX + b)
with tf.variable_scope('softmax_linear') as scope:
weights = _variable_with_weight_decay('weights', [192, NUM_CLASSES],
stddev=1/192.0, wd=0.0)
biases = _variable_on_cpu('biases', [NUM_CLASSES],
tf.constant_initializer(0.0))
softmax_linear = tf.add(tf.matmul(local4, weights), biases, name=scope.name)
_activation_summary(softmax_linear)
Does this mean I don't have to use tf.nn.softmax in the code?
You can use tf.nn.softmax in the code if you want, but then you will have to compute the loss yourself:
softmax_logits = tf.nn.softmax(logits)
loss = tf.reduce_mean(- labels * tf.log(softmax_logits) - (1. - labels) * tf.log(1. - softmax_logits))
In practice, you don't use tf.nn.softmax for computing the loss. However you need to use tf.nn.softmax if for instance you want to compute the predictions of your algorithm and compare them to the true labels (to compute accuracy).

How to define weight decay for individual layers in TensorFlow?

In CUDA ConvNet, we can write something like this (source) for each layer:
[conv32]
epsW=0.001
epsB=0.002
momW=0.9
momB=0.9
wc=0
where wc=0 refers to the L2 weight decay.
How can the same be achieved in TensorFlow?
You can add all the variables you want to add weight decay to, to a collection name 'variables' and then you calculate the L2 norm weight decay for the whole collection.
# Create your variables
weights = tf.get_variable('weights', collections=['variables'])
with tf.variable_scope('weights_norm') as scope:
weights_norm = tf.reduce_sum(
input_tensor = WEIGHT_DECAY_FACTOR*tf.pack(
[tf.nn.l2_loss(i) for i in tf.get_collection('weights')]
),
name='weights_norm'
)
# Add the weight decay loss to another collection called losses
tf.add_to_collection('losses', weights_norm)
# Add the other loss components to the collection losses
# ...
# To calculate your total loss
tf.add_n(tf.get_collection('losses'), name='total_loss')
get_variable(
name,
shape=None,
dtype=None,
initializer=None,
regularizer=None,
trainable=True,
collections=None,
caching_device=None,
partitioner=None,
validate_shape=True,
use_resource=None,
custom_getter=None)
This is the usage of tensorflow function get_variable. You can easily specify the regularizer to do weight decay.
Following is an example:
weight_decay = tf.constant(0.0005, dtype=tf.float32) # your weight decay rate, must be a scalar tensor.
W = tf.get_variable(name='weight', shape=[4, 4, 256, 512], regularizer=tf.contrib.layers.l2_regularizer(weight_decay))
Both current answers are wrong in that they do not give you "weight decay as in cuda-convnet" but instead L2-regularization, which is different.
When using pure SGD (without momentum) as an optimizer, weight decay is the same thing as adding a L2-regularization term to the loss. When using any other optimizer, this is not true.
Weight decay (don't know how to TeX here, so excuse my pseudo-notation):
w[t+1] = w[t] - learning_rate * dw - weight_decay * w
L2-regularization:
loss = actual_loss + lambda * 1/2 sum(||w||_2 for w in network_params)
Computing the gradient of the extra term in L2-regularization gives lambda * w and thus inserting it into the SGD update equation
dloss_dw = dactual_loss_dw + lambda * w
w[t+1] = w[t] - learning_rate * dw
gives the same as weight decay, but mixes lambda with the learning_rate. Any other optimizer, even SGD with momentum, gives a different update rule for weight decay as for L2-regularization! See the paper Fixing weight decay in Adam for more details. (Edit: AFAIK, this 1987 Hinton paper introduced "weight decay", literally as "each time the weights are updated, their magnitude is also decremented by 0.4%" at page 10)
That being said, there doesn't seem to be support for "proper" weight decay in TensorFlow yet. There are a few issues discussing it, specifically because of above paper.
One possible way to implement it is by writing an op that does the decay step manually after every optimizer step. A different way, which is what I'm currently doing, is using an additional SGD optimizer just for the weight decay, and "attaching" it to your train_op. Both of these are just crude work-arounds, though. My current code:
# In the network definition:
with arg_scope([layers.conv2d, layers.dense],
weights_regularizer=layers.l2_regularizer(weight_decay)):
# define the network.
loss = # compute the actual loss of your problem.
train_op = optimizer.minimize(loss, global_step=global_step)
if args.weight_decay not in (None, 0):
with tf.control_dependencies([train_op]):
sgd = tf.train.GradientDescentOptimizer(learning_rate=1.0)
train_op = sgd.minimize(tf.add_n(tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)))
This somewhat makes use of TensorFlow's provided bookkeeping. Note that the arg_scope takes care of appending an L2-regularization term for every layer to the REGULARIZATION_LOSSES graph-key, which I then all sum up and optimize using SGD which, as shown above, corresponds to actual weight-decay.
Hope that helps, and if anyone gets a nicer code snippet for this, or TensorFlow implements it better (i.e. in the optimizers), please share.
Edit: see also this PR which just got merged into TF.