Tensorflow: Why is my loss declining although my gradients are zero? - tensorflow

For debugging of my code and understanding of RNNs I set my gradients manually to 0 like this:
gvs = optimizer.compute_gradients(cost)
gvs[0] = (tf.zeros((5002,2), dtype=tf.float32), tf.trainable_variables()[0])
gvs[1] = (tf.zeros((2,), dtype=tf.float32), tf.trainable_variables()[1])
train_op = optimizer.apply_gradients(gvs)
I only have two trainable variables, so above quick-and-dirty approach should set all gradients to zero:
tf.trainable_variables()
Out[8]:
[<tf.Variable 'rnn/basic_rnn_cell/kernel:0' shape=(5002, 2) dtype=float32_ref>,
<tf.Variable 'rnn/basic_rnn_cell/bias:0' shape=(2,) dtype=float32_ref>]
When I run the network the loss is still declining. How can that be? As far as I understand the new variable values should be old value + learning rate * gradients.
I am using the AdaGradOptimizer.
Update: np.sum(sess.run(gvs[0][0])) and np.sum(sess.run(gvs[1][0])) both return 0.

Related

mixed_precision - not learning anything

I am trying to use mixed_precision for my neural network. Currently I am trying with only one image and no augmentation. When I am not using
policy = tf.keras.mixed_precision.Policy('mixed_float16')
tf.keras.mixed_precision.set_global_policy(policy)
Without Mixed Precision
With Mixed Precision
Creating Optimizer:
optimizer = tf.keras.optimizers.RMSprop()
optimizer = tf.keras.mixed_precision.LossScaleOptimizer(optimizer)
Training step:
with tf.GradientTape() as tape:
heat_pred = self.model(image_batch, training=True)
loss = self.getTotalLoss(heat_pred, annotation_batch)
print("Training loss: {}".format(loss))
scaled_loss = optimizer.get_scaled_loss(loss)
print("Training scaled_loss: {}".format(scaled_loss))
scaled_gradients = tape.gradient(scaled_loss, self.model.trainable_variables)
gradients = optimizer.get_unscaled_gradients(scaled_gradients)
optimizer.apply_gradients(zip(gradients, self.model.trainable_variables))
Has anyone seen something like this?
EDIT:
If I check the gradients they are not 0. They look something like:
[-2.203200e+04, 4.542500e+02, 1.624000e+03, ...,
6.125000e+02, 6.860000e+02, 7.970000e+02],
[-1.819000e+03, 2.240625e+01, 1.200625e+02, ...,
4.284375e+01, 6.003125e+01, 7.137500e+01],
[-1.928000e+04, 3.230000e+02, 1.611000e+03, ...,
4.502500e+02, 8.300000e+02, 7.055000e+02]]]], dtype=float32)>, <tf.Tensor: shape=(18,), dtype=float32, numpy=
array([-40192., 981., 3306., 1214., 2340., 2396., 1392.,
2808., 2060., 3936., 2304., 4408., 2352., 3656.,
2282., 1054., 1890., 1842.], dtype=float32)>]
But when optimizer.apply_gradients(zip(gradients, self.model.trainable_variables)) is called, the weights don't seem to be updated. (Since the loss is constant)

Why is K.gradients is returning none for gradient of loss wrt input

I am wondering why I am getting none for my grads in the following code:
import tensorflow.keras.losses as losses
loss = losses.squared_hinge(y_true, y_pred)
from tensorflow.keras import backend as K
grads = K.gradients(loss, CNN_model.input)[0]
iterate = K.function([CNN_model.input], [loss, grads])
my CNN_model.input is:
<tf.Tensor 'conv2d_3_input:0' shape=(?, 28, 28, 1) dtype=float32>
my loss is:
<tf.Tensor 'Mean_3:0' shape=(1,) dtype=float64>
Note: I am passing the predicted output of an SVM as y_pred for my application if that is of importance.
As far as I understood from my previous experience, Tensorflow needs to use GradientTape in order to record the activity of a certain variable and so to compute its gradients. In your case should be something like that:
x = np.random.rand(10) #your input variable
x = tf.Variable(x) #to be evaluated by GradientTape the input should be a tensor
with tf.GradientTape() as tape:
tape.watch(x) #with this method you can observe your variable
proba = model(x) #get the prediction of the input
loss = your_loss_function(y_true, proba) #compute the loss
gradient = tape.gradient(loss, x) #compute the gradients, this must be done outside the recording

How to change the LSTMCell weight format from tensorflow to tf.keras

I I have some old code from tensorflow that I want to make work for tensorflow2/tf.keras. I would like to keep the same LSTM weights, but cannot figure out how to convert the format.
I have the old weights saved in a checkpoint file, and also have them saved in csv files.
My old code looks something like this:
input_placeholder = tf.placeholder(tf.float32, [None, None, input_units])
lstm_layers = [tf.nn.rnn_cell.LSTMCell(layer_size), tf.nn.rnn_cell.LSTMCell(layer_size)]
stacked = tf.contrib.rnn.MultiRNNCell(lstm_layers)
features, state = tf.nn.dynamic_rnn(stacked, input_placeholder, dtype=tf.float32)
And my new code looks something like this:
input_placeholder = tf.placeholder(tf.float32, [None, None, input_units])
lstm_layers = [tf.keras.layers.LSTMCell(layer_size),tf.keras.layers.LSTMCell(layer_size)]
stacked = tf.keras.layers.StackedRNNCells(lstm_layers)
features = stacked(input_placeholder)
... #later in the code
features.set_weights(previous_weights)
The old bias seems to match the new bias.
The old kernel seems to be the concatenation of the kernel and recurrent kernel.
I am able to load the previous_weights into the model (have explicitly checked the weights loaded correctly), however tests I have fail to produce the same result.
Digging into the source code, the kernels seem to have a different format under the hood.
Is it possible to calculate the kernel and recurrent_kernel (tf.keras) using these old saved kernel weights?
Links if they're helpful:
https://github.com/tensorflow/tensorflow/blob/r1.13/tensorflow/python/ops/rnn_cell_impl.py
https://github.com/tensorflow/tensorflow/blob/r1.13/tensorflow/python/keras/layers/recurrent.py
In case anyone else encounters this.
There are three differences that I found for migrating weights:
The kernel is shuffled in axis=0. Both implementations use one (or two) dot-products to do four dot-product operations that the lstm calls for by concatenating weights in axis=0. The challenge is that the middle two quarters of this concatenated weight matrix are swapped.
The kernel is divided in axis=1. The rnn_cell implementation has a single weights matrix that is dot-producted with a concatenation of the inputs and the hidden state, where as the keras implementation stores these as two attributes: _kernel and _recurrent_kernel, and dot-products these separately before summing them.
A forget bias is explicitly added in the cell calculation from rnn_cell, but is integrated into the cell bias in keras, with the option modifying the initialisation only.
A migration function that accounts for these three differences is
def convert_lstm_weights(tf1_kernel, tf1_bias, forget_bias=True):
a, b, c, d = tf.split(tf1_kernel, num_or_size_splits=4, axis=1)
lstm_kernel = tf.concat(values=[a, c, b, d], axis=1)
kernel, recurrent_kernel = lstm_kernel[:-hps.hidden_dim], lstm_kernel[-hps.hidden_dim:]
a, b, c, d = tf.split(tf1_bias, num_or_size_splits=4, axis=0)
bias = tf.concat(values=[a, c + float(forget_bias), b, d], axis=0) # + 1 to account for forget bias
return kernel, recurrent_kernel, bias
And two differences I've found that need to be accounted for during use:
The activation function in tf.compat.v1.nn.rnn_cell.LSTMCell is sigmoid but tf.keras.LSTMCell is hard sigmoid so this needs to be set on initialization with activation="sigmoid".
The states are returned in opposite orders.
output, (c_state_new, m_state_new) = tf.compat.v1.nn.rnn_cell.LSTMCell(hidden_size, state_is_tuple=True)(input, (c_state, m_state))
becomes
output, (h_state_new, c_state_new) = tf.keras.layers.LSTMCell(hidden_size, activation="sigmoid")(input, (h_state, c_state))
where the hidden state is referred to by m in rnn_cell and h in keras.
You can split the matrix:
If you see here, kernel matrix of TF1 has shape of (input_shape[-1], self.units).
Let's say you have 20 inputs and 128 nodes in an LSTM layer
input_units=20
layer_size = 128
input_placeholder = tf.placeholder(tf.float32, [None, None, input_units])
lstm_layers = [tf.nn.rnn_cell.LSTMCell(layer_size), tf.nn.rnn_cell.LSTMCell(layer_size)]
stacked = tf.contrib.rnn.MultiRNNCell(lstm_layers)
output, state = tf.nn.dynamic_rnn(stacked, input_placeholder, dtype=tf.float32)
Your trainable parameters will have these shapes:
[<tf.Variable 'rnn/multi_rnn_cell/cell_0/lstm_cell/kernel:0' shape=(148, 512) dtype=float32_ref>,
<tf.Variable 'rnn/multi_rnn_cell/cell_0/lstm_cell/bias:0' shape=(512,) dtype=float32_ref>,
<tf.Variable 'rnn/multi_rnn_cell/cell_1/lstm_cell/kernel:0' shape=(256, 512) dtype=float32_ref>,
<tf.Variable 'rnn/multi_rnn_cell/cell_1/lstm_cell/bias:0' shape=(512,) dtype=float32_ref>]
In TF 1.0, kernel and recurrent kernel of TF 2.0 is concatenated (see here)
def build(self, input_shape):
self.kernel = self.add_weight(shape=(input_shape[-1], self.units),
initializer='uniform',
name='kernel')
self.recurrent_kernel = self.add_weight(
shape=(self.units, self.units),
initializer='uniform',
name='recurrent_kernel')
self.built = True
At this new version you have now two different weight matrices.
input_placeholder = tf.placeholder(tf.float32, [None, None, input_units])
lstm_layers = [tf.keras.layers.LSTMCell(layer_size),tf.keras.layers.LSTMCell(layer_size)]
stacked = tf.keras.layers.StackedRNNCells(lstm_layers)
output = tf.keras.layers.RNN(stacked, return_sequences=True, return_state=True, dtype=tf.float32)
Thus, your trainable parameters are:
<tf.Variable 'rnn_1/while/stacked_rnn_cells_1/kernel:0' shape=(20, 512) dtype=float32>,
<tf.Variable 'rnn_1/while/stacked_rnn_cells_1/recurrent_kernel:0' shape=(128, 512) dtype=float32>,
<tf.Variable 'rnn_1/while/stacked_rnn_cells_1/bias:0' shape=(512,) dtype=float32>,
<tf.Variable 'rnn_1/while/stacked_rnn_cells_1/kernel_1:0' shape=(128, 512) dtype=float32>,
<tf.Variable 'rnn_1/while/stacked_rnn_cells_1/recurrent_kernel_1:0' shape=(128, 512) dtype=float32>,
<tf.Variable 'rnn_1/while/stacked_rnn_cells_1/bias_1:0' shape=(512,) dtype=float32>]

Tensorflow - No gradients provided for any variable

I am experimenting some code on Jupyter and keep getting stuck here. Things work actually fine if I remove the line starting with "optimizer = ..." and all references to this line. But if I put this line in the code, it gives an error.
I am not pasting all other functions here to keep the size of the code at a readable level. I hope someone more experienced can see it at once what is the problem here.
Note that there are 5, 4, 3, and 2 units in input layer, in 2 hidden layers, and in output layers.
CODE:
tf.reset_default_graph()
num_units_in_layers = [5,4,3,2]
X = tf.placeholder(shape=[5, 3], dtype=tf.float32)
Y = tf.placeholder(shape=[2, 3], dtype=tf.float32)
parameters = initialize_layer_parameters(num_units_in_layers)
init = tf.global_variables_initializer()
my_sess = tf.Session()
my_sess.run(init)
ZL = forward_propagation_with_relu(X, num_units_in_layers, parameters, my_sess)
#my_sess.run(parameters) # Do I need to run this? Or is it obsolete?
cost = compute_cost(ZL, Y, my_sess, parameters, batch_size=3, lambd=0.05)
optimizer = tf.train.AdamOptimizer(learning_rate = 0.001).minimize(cost)
_ , minibatch_cost = my_sess.run([optimizer, cost],
feed_dict={X: minibatch_X,
Y: minibatch_Y})
print(minibatch_cost)
my_sess.close()
ERROR:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-321-135b9fc18268> in <module>()
16 cost = compute_cost(ZL, Y, my_sess, parameters, 3, 0.05)
17
---> 18 optimizer = tf.train.AdamOptimizer(learning_rate = 0.001).minimize(cost)
19 _ , minibatch_cost = my_sess.run([optimizer, cost],
20 feed_dict={X: minibatch_X,
~/.local/lib/python3.5/site-packages/tensorflow/python/training/optimizer.py in minimize(self, loss, global_step, var_list, gate_gradients, aggregation_method, colocate_gradients_with_ops, name, grad_loss)
362 "No gradients provided for any variable, check your graph for ops"
363 " that do not support gradients, between variables %s and loss %s." %
--> 364 ([str(v) for _, v in grads_and_vars], loss))
365
366 return self.apply_gradients(grads_and_vars, global_step=global_step,
ValueError: No gradients provided for any variable, check your graph for ops that do not support gradients, between variables ["<tf.Variable 'weights/W1:0' shape=(4, 5) dtype=float32_ref>", "<tf.Variable 'biases/b1:0' shape=(4, 1) dtype=float32_ref>", "<tf.Variable 'weights/W2:0' shape=(3, 4) dtype=float32_ref>", "<tf.Variable 'biases/b2:0' shape=(3, 1) dtype=float32_ref>", "<tf.Variable 'weights/W3:0' shape=(2, 3) dtype=float32_ref>", "<tf.Variable 'biases/b3:0' shape=(2, 1) dtype=float32_ref>"] and loss Tensor("Add_3:0", shape=(), dtype=float32).
Note that if I run
print(tf.trainable_variables())
just before the "optimizer = ..." line, I actually see my trainable variables there.
hts/W1:0' shape=(4, 5) dtype=float32_ref>, <tf.Variable 'biases/b1:0' shape=(4, 1) dtype=float32_ref>, <tf.Variable 'weights/W2:0' shape=(3, 4) dtype=float32_ref>, <tf.Variable 'biases/b2:0' shape=(3, 1) dtype=float32_ref>, <tf.Variable 'weights/W3:0' shape=(2, 3) dtype=float32_ref>, <tf.Variable 'biases/b3:0' shape=(2, 1) dtype=float32_ref>]
Would anyone have an idea about what can be the problem?
EDITING and ADDING SOME MORE INFO:
In case you would like to see how I create & initialize my parameters, here is the code. Maybe there is sth wrong with this part but I don't see what..
def get_nn_parameter(variable_scope, variable_name, dim1, dim2):
with tf.variable_scope(variable_scope, reuse=tf.AUTO_REUSE):
v = tf.get_variable(variable_name,
[dim1, dim2],
trainable=True,
initializer = tf.contrib.layers.xavier_initializer())
return v
def initialize_layer_parameters(num_units_in_layers):
parameters = {}
L = len(num_units_in_layers)
for i in range (1, L):
temp_weight = get_nn_parameter("weights",
"W"+str(i),
num_units_in_layers[i],
num_units_in_layers[i-1])
parameters.update({"W" + str(i) : temp_weight})
temp_bias = get_nn_parameter("biases",
"b"+str(i),
num_units_in_layers[i],
1)
parameters.update({"b" + str(i) : temp_bias})
return parameters
#
ADDENDUM
I got it working. Instead of writing a separate answer, I am adding the correct version of my code here.
(David's answer below helped a lot.)
I simply removed the my_sess as parameter to my compute_cost function. (I could not make it work previously but seemingly it is not needed at all.) And I also reordered statements in my main function to call things in the right order.
Here is the working version of my cost function and how I call it:
def compute_cost(ZL, Y, parameters, mb_size, lambd):
logits = tf.transpose(ZL)
labels = tf.transpose(Y)
cost_unregularized = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(logits = logits, labels = labels))
#Since the dict parameters includes both W and b, it needs to be divided with 2 to find L
L = len(parameters) // 2
list_sum_weights = []
for i in range (0, L):
list_sum_weights.append(tf.nn.l2_loss(parameters.get("W"+str(i+1))))
regularization_effect = tf.multiply((lambd / mb_size), tf.add_n(list_sum_weights))
cost = tf.add(cost_unregularized, regularization_effect)
return cost
And here is the main function where I call the compute_cost(..) function:
tf.reset_default_graph()
num_units_in_layers = [5,4,3,2]
X = tf.placeholder(shape=[5, 3], dtype=tf.float32)
Y = tf.placeholder(shape=[2, 3], dtype=tf.float32)
parameters = initialize_layer_parameters(num_units_in_layers)
my_sess = tf.Session()
ZL = forward_propagation_with_relu(X, num_units_in_layers, parameters)
cost = compute_cost(ZL, Y, parameters, 3, 0.05)
optimizer = tf.train.AdamOptimizer(learning_rate = 0.001).minimize(cost)
init = tf.global_variables_initializer()
my_sess.run(init)
_ , minibatch_cost = my_sess.run([optimizer, cost],
feed_dict={X: [[-1.,4.,-7.],[2.,6.,2.],[3.,3.,9.],[8.,4.,4.],[5.,3.,5.]],
Y: [[0.6, 0., 0.3], [0.4, 0., 0.7]]})
print(minibatch_cost)
my_sess.close()
I'm 99.9% sure you're creating your cost function incorrectly.
cost = compute_cost(ZL, Y, my_sess, parameters, batch_size=3, lambd=0.05)
Your cost function should be a tensor. You are passing your session into the cost function, which looks like it's actually trying to run tensorflow session which is grossly in error.
Then later you're passing the result of compute_cost to your minimizer.
This is a common misunderstanding about tensorflow.
Tensorflow is a declarative programming paradigm, that means that you first declare all the operations you want to run, then later you pass data in and run it.
Refactor your code to strictly follow this best practice:
(1) Create a build_graph() function, in this function all of your math operations should be placed. You should define your cost function and all layers of the network. Return the optimize.minimize() training op (and any other OPs you might want to get back such as accuracy).
(2) Now create a session.
(3) After this point do not create any more tensorflow operations or variables, if you feel like you need to you're doing something wrong.
(4) Call sess.run on your train_op, and pass in the placeholder data via feed_dict.
Here's a simple example of how to structure your code:
https://github.com/aymericdamien/TensorFlow-Examples/blob/master/notebooks/3_NeuralNetworks/neural_network_raw.ipynb
In general there are tremendously good examples put up by aymericdamien, I strongly recommend reviewing them to learn the basics of tensorflow.

Tensorflow: Using Batch Normalization gives poor (erratic) validation loss and accuracy

I am trying to use Batch Normalization using tf.layers.batch_normalization() and my code looks like this:
def create_conv_exp_model(fingerprint_input, model_settings, is_training):
# Dropout placeholder
if is_training:
dropout_prob = tf.placeholder(tf.float32, name='dropout_prob')
# Mode placeholder
mode_placeholder = tf.placeholder(tf.bool, name="mode_placeholder")
he_init = tf.contrib.layers.variance_scaling_initializer(mode="FAN_AVG")
# Input Layer
input_frequency_size = model_settings['bins']
input_time_size = model_settings['spectrogram_length']
net = tf.reshape(fingerprint_input,
[-1, input_time_size, input_frequency_size, 1],
name="reshape")
net = tf.layers.batch_normalization(net,
training=mode_placeholder,
name='bn_0')
for i in range(1, 6):
net = tf.layers.conv2d(inputs=net,
filters=8*(2**i),
kernel_size=[5, 5],
padding='same',
kernel_initializer=he_init,
name="conv_%d"%i)
net = tf.layers.batch_normalization(net,
training=mode_placeholder,
name='bn_%d'%i)
with tf.name_scope("relu_%d"%i):
net = tf.nn.relu(net)
net = tf.layers.max_pooling2d(net, [2, 2], [2, 2], 'SAME',
name="maxpool_%d"%i)
net_shape = net.get_shape().as_list()
net_height = net_shape[1]
net_width = net_shape[2]
net = tf.layers.conv2d( inputs=net,
filters=1024,
kernel_size=[net_height, net_width],
strides=(net_height, net_width),
padding='same',
kernel_initializer=he_init,
name="conv_f")
net = tf.layers.batch_normalization( net,
training=mode_placeholder,
name='bn_f')
with tf.name_scope("relu_f"):
net = tf.nn.relu(net)
net = tf.layers.conv2d( inputs=net,
filters=model_settings['label_count'],
kernel_size=[1, 1],
padding='same',
kernel_initializer=he_init,
name="conv_l")
### Squeeze
squeezed = tf.squeeze(net, axis=[1, 2], name="squeezed")
if is_training:
return squeezed, dropout_prob, mode_placeholder
else:
return squeezed, mode_placeholder
And my train step looks like this:
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate_input)
gvs = optimizer.compute_gradients(cross_entropy_mean)
capped_gvs = [(tf.clip_by_value(grad, -2., 2.), var) for grad, var in gvs]
train_step = optimizer.apply_gradients(gvs))
During training, I am feeding the graph with:
train_summary, train_accuracy, cross_entropy_value, _, _ = sess.run(
[
merged_summaries, evaluation_step, cross_entropy_mean, train_step,
increment_global_step
],
feed_dict={
fingerprint_input: train_fingerprints,
ground_truth_input: train_ground_truth,
learning_rate_input: learning_rate_value,
dropout_prob: 0.5,
mode_placeholder: True
})
During validation,
validation_summary, validation_accuracy, conf_matrix = sess.run(
[merged_summaries, evaluation_step, confusion_matrix],
feed_dict={
fingerprint_input: validation_fingerprints,
ground_truth_input: validation_ground_truth,
dropout_prob: 1.0,
mode_placeholder: False
})
My loss and accuracy curves (orange is training, blue is validation):
Plot of loss vs number of iterations,
Plot of accuracy vs number of iterations
The validation loss (and accuracy) seem very erratic. Is my implementation of Batch Normalization wrong? Or is this normal with Batch Normalization and I should wait for more iterations?
You need to pass is_training to tf.layers.batch_normalization(..., training=is_training) or it tries to normalize the inference minibatches using the minibatch statistics instead of the training statistics, which is wrong.
There are mainly two things to check.
1. Are you sure that you are using batch normalization (BN) correctly in the train op?
If you read the layer documentation:
Note: when training, the moving_mean and moving_variance need to be updated.
By default the update ops are placed in tf.GraphKeys.UPDATE_OPS, so they
need to be added as a dependency to the train_op. Also, be sure to add
any batch_normalization ops before getting the update_ops collection.
Otherwise, update_ops will be empty, and training/inference will not work
properly.
For example:
x_norm = tf.layers.batch_normalization(x, training=training)
# ...
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
train_op = optimizer.minimize(loss)
2. Otherwise, try lowering the "momentum" in the BN.
During the training, in fact, the BN uses two moving averages of the mean and the variance that are supposed to approximate the population statistics. Mean and variance are initialized to 0 and 1 respectively and then, step by step, they are multiplied by the momentum value (default is 0.99) and added the new value*0.01. At inference (test) time, the normalization uses these statistics. For this reason, it takes these values a little while to arrive at the "real" mean and variance of the data.
Source:
https://www.tensorflow.org/api_docs/python/tf/layers/batch_normalization
https://github.com/keras-team/keras/issues/7265
https://github.com/keras-team/keras/issues/3366
The original BN paper can be found here:
https://arxiv.org/abs/1502.03167
I also observed oscillations in validation loss when adding batch norm before ReLU. We found that moving the batch norm after the ReLU resolved the issue.