I want to calculate gradients for both trainable and non-trainable variables.
And update only trainable parameters.
At first, I implemented it as follows
with tf.GradientTape(persistent = True) as g:
preds = model(data)
loss = criterion(labels, preds)
gradients = g.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
non_train_gradients = g.gradient(loss, model.non_trainable_variables)
However, the above code do twice backpropagation to calculate gradients.
I want to estimate the gradients of both trainable and non-trainable variables simultaneosuly,
but only updates trainable parameters.
How can I do it?
We can use the fact that the gradients are just a list and are returned in the same order as the variables we put in:
n_trainable = len(model.trainable_variables)
gradients = g.gradient(
loss, model.trainable_variables + model.non_trainable_variables
)
trainable_gradients = gradients[:n_trainable]
non_trainable_gradients = gradients[n_trainable:]
optimizer.apply_gradients(
zip(trainable_gradients, model.trainable_variables)
)
That is, we just put all the non-trainable variables at the end, and then split the gradients at that point.
Related
I am using a pretrained model like so:
base_model = keras.applications.Xception(
weights='imagenet',
input_shape=(150,150,3),
include_top=False
)
Then I freeze all the layers:
base_model.trainable = False
Now, I would like to unfreeze only, let's say the most lower layer.
When I do base_model.summary() That's what at the bottom:
So, let's say I would like to unfreeze block14_sepconv2 layer.
I do:
my_layer = base_model.get_layer('block14_sepconv2')
my_layer.trainable = True
And summary() still shows, that Trainable params: 0
What am I doing wrong? How to unfreeze only few of the lowest layers?
Whats intresting, when I firstly do base_model.trainable = True, and then I am freezing layers strating from the top, trainable params number actually changes. But its not intuitive for me, and primarly not incomprehensible.
Here is one way to unfreeze specific layers. We pick the same model and some layers (e.g. block14_sepconv2). The purpose is to unfreeze these layers and make the rest of the layers freeze.
from tensorflow import keras
base_model = keras.applications.Xception(
weights='imagenet',
input_shape=(150,150,3),
include_top=False
)
# free all layer except the desired layers
# which is in [ ... ]
for layer in base_model.layers:
if layer.name not in ['block14_sepconv2', 'block13_sepconv1']:
layer.trainable = False
if layer.trainable:
print(layer.name)
block14_sepconv2
block13_sepconv1
Compute the trainable and non-trainable variables.
import tensorflow.keras.backend as K
import numpy as np
trainable_count = np.sum([K.count_params(w) \
for w in base_model.trainable_weights])
non_trainable_count = np.sum([K.count_params(w) \
for w in base_model.non_trainable_weights])
print('Total params: {:,}'.format(trainable_count + non_trainable_count))
print('Trainable params: {:,}'.format(trainable_count))
print('Non-trainable params: {:,}'.format(non_trainable_count))
Total params: 20,861,480
Trainable params: 3,696,088
Non-trainable params: 17,165,392
FYI, don't forget to recompile your model (model.compile(...)) each time you freeze or unfreeze the layers.
I would like to create a layer (with tensorflow.keras) which contains both trainable and non trainable weights. I tried doing so by subclassing keras.layers.Layer as in this example:
class MySum(keras.layers.Layer):
def __init__(self, units=32, **kwargs):
super(MySum, self).__init__(**kwargs)
self.units = units
def build(self, input_shape):
n_input = input_shape[-1] # nb of input elements
n_output = self.units # nb of layer neurons
n_input_div_2 = input_shape[-1] // 2
# 1. add the trainable weights
self.w = self.add_weight(shape=(n_input_div_2, self.units),
initializer=tf.ones_initializer(),
trainable=True)
# 2. add the non trainable weights
self.w = self.add_weight(shape=(input_shape[-1]-n_input_div_2, self.units),
initializer=tf.keras.initializers.Constant(value=3),
trainable=False)
def call(self, inputs):
return tf.matmul(inputs, self.w)
Unfortunately, doing so all the weights are non trainable. If I add first the non trainable weights then all the weights are trainable (it seems that the trainable flag is set according to the last weights added).
What am I missing here?
EDIT:
I tried to use different names as suggested by Dr. Snoopy in the build function:
# 1. add the trainable weights
w1 = self.add_weight(shape=(n_input_div_2, self.units),
initializer=tf.ones_initializer(),
trainable=True)
# 2. add the non trainable weights
w2 = self.add_weight(shape=(input_shape[-1]-n_input_div_2, self.units),
initializer=tf.keras.initializers.Constant(value=3),
trainable=False)
self.w = tf.concat([w1, w2], 0)
But, when I try to use my layer like this:
custom = customLayer.MySum(1, name='somme')
my_input = keras.Input(shape=(2,), name="input")
my_output = custom(my_input)
print(custom.get_weights())
I obtain via the get_weights() function:
tf.Tensor(
[[1.]
[3.]], shape=(2, 1), dtype=float32)
[array([[1.],
[1.]], dtype=float32), array([[1.]], dtype=float32), array([[3.]], dtype=float32)]
Where does the [[1.],[1.]] array come from? (I would like to have only the [[1.][3.]] array)
I have lots of warning when training my model: "WARNING:tensorflow:Gradients do not exist for variables ['somme/Variable:0', 'somme/Variable:0'] when minimizing the loss."
How does keras link my own weights (self.w) with the weights returned by get_weights()?
Note: when I create customed layers without mixing trainable and non trainable weights I don't have these issues.
As Dr. Snoopy pointed out, your first solution overwrites the previously defined weight by using the same variable name.
As to why your second solution does not work either, it is because after calling tf.concat on your two tf.Variable w1 and w2, e gradient disappears. It is a known bug on Tensorflow, you can find the issue on github here : Gradients do not exist for variables after tf.concat(). #37726
A Minimal reproducible example
Lets do some experiment using tf.GradientTape to calculate the gradient :
w1 = tf.Variable([1.0])
w2 = tf.Variable([3.0])
w = tf.expand_dims(tf.concat([w1,w2],0),-1)
X = tf.random.normal((1,2))
y = tf.reduce_sum(X,1)
with tf.GradientTape(persistent=True) as tape:
r = tf.matmul(w,X)
loss = tf.metrics.mse(y, w)
print(tape.gradient(loss, r))
Results in None.
A possible fix
One solution is to keep the Variable separated. For your layer, with a number of units=1, there is this trivial replacement of tf.matmul :
w1 = tf.Variable([1.0])
w2 = tf.Variable([3.0], trainable=False)
X = tf.random.normal((1,2))
y = tf.reduce_sum(X,1)
with tf.GradientTape(persistent=True) as tape:
r = X[:,0]*w1 + X[:,1]*w2
loss = tf.metrics.mse(y,r)
print(tape.gradient(loss, r))
Outputs : tf.Tensor([-3.1425157], shape=(1,), dtype=float32)
When we use tf.train.ExponentialMovingAverage.apply(var) to maintains moving averages of variables, then if we update a variable such as use tf.assign, to get decayed variable, we will use tf.train.ExponentialMovingAverage.average(var), but if we get the variable directly by tf.Session.run(var), we will get the variable without decay.
For example:
import tensorflow as tf;
v1 = tf.Variable(0, dtype=tf.float32)
ema = tf.train.ExponentialMovingAverage(0.99)
maintain_average = ema.apply([v1])
with tf.Session() as sess:
init = tf.initialize_all_variables()
sess.run(init)
print(sess.run([v1, ema.average(v1)]))
# Out:[0.0, 0.0]
sess.run(tf.assign(v1, 5))
sess.run(maintain_average)
print(sess.run([v1, ema.average(v1)]))
# Out: [10.0, 0.14949986]
So when we train a neural network with ExponentialMovingAverage, does the model default to using the decayed variable by tf.train.ExponentialMovingAverage.average()?
More concrete example:
image_tensor = tf.placeholder(tf.float32,
[BATCH_SIZE,IMAGE_SIZE,IMAGE_SIZE,IMAGE_CHANNELS],
'image-tensor')
label_tensor = tf.placeholder(tf.int32,
[None,10],
'label-tensor')
net_output = creat_net(image_tensor)
#suppose creat_net() have build a neural network
global_step = tf.Variable(0, trainable=False)
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=net_output, labels=label_tensor))
loss = cross_entropy
train_step = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss, global_step=global_step)
ema = tf.train.ExponentialMovingAverage(MOVING_AVERAGE_DECAY, global_step)
with tf.control_dependencies([train_step]):
training_op = ema.apply(tf.trainable_variables())
So when I run the training_op to train the network, the network will use the average at default or I need extra code to use decayed variables? In other words, GradientDescentOptimizer will use the true value or decayed value to compute loss in the next step?
v1 is a variable with its own value (10.0 in your case).
tf.train.ExponentialMovingAverage maintains a variable inside, that gets update each time you invoke average.
Every time you invoke average with a new input, you're just computing the next time step of the exponential moving average (hence just changing the private variable of tf.train.ExponentialMovingAverage op) without changing the input variables at all.
I am fine-tuning my model from a pretrained model using TF-Slim. When I used the create_train_op, I found that it has a parameter that is variables_to_train. In some tutorial, it used the flag as follows:
all_trainable = [v for v in tf.trainable_variables()]
trainable = [v for v in all_trainable]
train_op = slim.learning.create_train_op(
opt,
global_step=global_step,
variables_to_train=trainable,
summarize_gradients=True)
But in the official TF-Slim, it does not use
all_trainable = [v for v in tf.trainable_variables()]
trainable = [v for v in all_trainable]
train_op = slim.learning.create_train_op(
opt,
global_step=global_step,
summarize_gradients=True)
So, what is different between with and without using variables_to_train?
Your two example both do the same thing. You train all trainable variables that occur in your graph. With the parameter variables_to_train you can define which variables should be updated during your training.
A use case for this is when you have pre-trained stuff like word embedding that you don't want to train in your model. With
train_vars = [v for v in tf.trainable_variables() if "embeddings" not in v.name]
train_op = slim.learning.create_train_op(
opt,
global_step=global_step,
variables_to_train=train_vars,
summarize_gradients=True)
you can exclude all variables from training that contain "embeddings" in their name. If you simply want to train all variables, you don't have to define train_vars and you can create the train op without the parameter variables_to_train.
I am confused about the difference between apply_gradients and minimize of optimizer in tensorflow. For example,
optimizer = tf.train.AdamOptimizer(1e-3)
grads_and_vars = optimizer.compute_gradients(cnn.loss)
train_op = optimizer.apply_gradients(grads_and_vars, global_step=global_step)
and
optimizer = tf.train.AdamOptimizer(1e-3)
train_op = optimizer.minimize(cnn.loss, global_step=global_step)
Are they the same indeed?
If I want to decay the learning rate, can I use the following codes?
global_step = tf.Variable(0, name="global_step", trainable=False)
starter_learning_rate = 1e-3
learning_rate = tf.train.exponential_decay(starter_learning_rate, global_step,
100, FLAGS.decay_rate, staircase=True)
# Passing global_step to minimize() will increment it at each step.
learning_step = (
optimizer = tf.train.AdamOptimizer(learning_rate)
grads_and_vars = optimizer.compute_gradients(cnn.loss)
train_op = optimizer.apply_gradients(grads_and_vars, global_step=global_step)
)
Thanks for your help!
You can easily know from the link : https://www.tensorflow.org/get_started/get_started
(tf.train API part) that they actually do the same job.
The difference it that: if you use the separated functions( tf.gradients, tf.apply_gradients), you can apply other mechanism between them, such as gradient clipping.
here it says minimize uses tf.GradienTape and then apply_gradients:
Minimize loss by updating var_list.
This method simply computes gradient using tf.GradientTape and calls
apply_gradients(). If you want to process the gradient before applying
then call tf.GradientTape and apply_gradients() explicitly instead of
using this function.
So minimize actually uses apply_gradients just like:
def minimize(self, loss, var_list, grad_loss=None, name=None, tape=None):
grads_and_vars = self._compute_gradients(loss, var_list=var_list, grad_loss=grad_loss, tape=tape)
return self.apply_gradients(grads_and_vars, name=name)
In your example, you use compute_gradients and apply_gradients, this is indeed valid but nowadays, compute_gradients was made private and is therefore not good practice to use it. For this reason the function is not longer on the documentation.