Why the gradient of categorical crossentropy loss with respect to logits is 0 with gradient tape in TF2.0? - tensorflow

I am learning Tensorflow 2.0 and I am trying to figure out how Gradient Tapes work. I have this simple example, in which, I evaluate the cross entropy loss between logits and labels. I am wondering why the gradients with respect to logits is being zero. (Please look at the code below).
The version of TF is tensorflow-gpu==2.0.0-rc0.
logits = tf.Variable([[1, 0, 0], [1, 0, 0], [1, 0, 0]], type=tf.float32)
labels = tf.constant([[1, 0, 0], [0, 1, 0], [0, 0, 1]],dtype=tf.float32)
with tf.GradientTape(persistent=True) as tape:
loss = tf.reduce_sum(tf.losses.categorical_crossentropy(labels, logits))
grads = tape.gradient(loss, logits)
I am getting
[[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]], shape=(3, 3), dtype=float32)
as a result, but should not it tell me how much should I change logits in order to minimize the loss?

When calculate the cross entropy loss, set from_logits=True in the tf.losses.categorical_crossentropy(). In default, it's false, which means you are directly calculate the cross entropy loss using -p*log(q). By setting the from_logits=True, you are using -p*log(softmax(q)) to calculate the loss.
Just find one interesting results.
logits = tf.Variable([[0.8, 0.1, 0.1]], dtype=tf.float32)
labels = tf.constant([[1, 0, 0]],dtype=tf.float32)
with tf.GradientTape(persistent=True) as tape:
loss = tf.reduce_sum(tf.keras.losses.categorical_crossentropy(labels, logits, from_logits=False))
grads = tape.gradient(loss, logits)
The grads will be tf.Tensor([[-0.25 1. 1. ]], shape=(1, 3), dtype=float32)
Previously, I thought tensorflow will use loss=-\Sigma_i(p_i)\log(q_i) to calculate the loss, and if we derive on q_i, we will have the derivative be -p_i/q_i. So, the expected grads should be [-1.25, 0, 0]. But the output grads looks like all increased by 1. But it won't affect the optimization process.
For now, I'm still trying to figure out why the grads will be increased by one. After reading the source code of tf.categorical_crossentropy, I found that even though we set from_logits=False, it still normalize the probabilities. That will change the final gradient expression. Specifically, the gradient will be -p_i/q_i+p_i/sum_j(q_j). If p_i=1 and sum_j(q_j)=1, the final gradient will plus one. That's why the gradient will be -0.25, however, I haven't figured out why the last two gradients would be 1..
To prove that all gradients are increased by 1/sum_j(q_j),
logits = tf.Variable([[0.5, 0.1, 0.1]], dtype=tf.float32)
labels = tf.constant([[1, 0, 0]],dtype=tf.float32)
with tf.GradientTape(persistent=True) as tape:
loss = tf.reduce_sum(tf.keras.losses.categorical_crossentropy(labels, logits, from_logits=False))
grads = tape.gradient(loss, logits)
The grads are tf.Tensor([[-0.57142866 1.4285713 1.4285713 ]], which should be [-2,0,0].
It shows that all gradients are increased by 1/(0.5+0.1+0.1). For the p_i==1, the gradient increased by 1/(0.5+0.1+0.1) makes sense to me. But I don't understand why p_i==0, the gradient is still increased by 1/(0.5+0.1+0.1).

I finally figured that.
The keras categorical cross entropy calculates the gradient using the following way:
sum(target) / sum(input) - target / input
You just sum the values for both targets and inputs ONLY if the input(i) is different from ZERO.


How does BatchNormalization work on an example?

I am trying to understand batchnorm.
My humble example
layer1 = tf.keras.layers.BatchNormalization(scale=False, center=False)
x = np.array([[3.,4.]])
out = layer1(x)
tf.Tensor([[2.99850112 3.9980015 ]], shape=(1, 2), dtype=float64)
My attempt to reproduce it
m = np.sum(x)/2
b = np.sum((x - m)**2)/2
It prints
[[-0.99800598 0.99800598]]
What am I doing wrong?
Two problems here.
First, batch norm has two "modes": Training, where normalization is done via the batch statistics, and inference, where normalization is done via "population statistics" that are collected from batches during training. Per default, keras layers/models function in inference mode, and you need to specify training=True in their call to change this (there are other ways, but that is the simplest one).
layer1 = tf.keras.layers.BatchNormalization(scale=False, center=False)
x = np.array([[3.,4.]], dtype=np.float32)
out = layer1(x, training=True)
This prints tf.Tensor([[0. 0.]], shape=(1, 2), dtype=float32). Still not right!
Second, batch norm normalizes over the batch axis, separately for each feature. However, the way you specify the input (as a 1x2 array) is basically a single input (batch size 1) with two features. Batch norm just normalizes each feature to mean 0 (standard deviation is not defined). Instead, you want two inputs with a single feature:
layer1 = tf.keras.layers.BatchNormalization(scale=False, center=False)
x = np.array([[3.],[4.]], dtype=np.float32)
out = layer1(x, training=True)
This prints
[ 0.99800587]], shape=(2, 1), dtype=float32)
Alternatively, specify the "feature axis":
layer1 = tf.keras.layers.BatchNormalization(axis=0, scale=False, center=False)
x = np.array([[3.,4.]], dtype=np.float32)
out = layer1(x, training=True)
Note that the input shape is "wrong", but we told batchnorm that axis 0 is the feature axis (it defaults to -1, the last axis). This will also give the desired result:
tf.Tensor([[-0.99800634 0.99800587]], shape=(1, 2), dtype=float32)

Binary Logistic Regression - do we need to one_hot encode label?

I have a logistic regression model which I created referring this link
The label is a Boolean value (0 or 1 as values).
Do we need to do one_hot encode the label in this case?
The reason for asking : I use the below function for finding the cross_entropy and loss is always coming as zero.
def cross_entropy(y_true, y_pred):
y_true = tf.one_hot([y_true.numpy()], 2)
loss_row = tf.nn.softmax_cross_entropy_with_logits(labels=y_true, logits=y_pred)
return tf.reduce_mean(loss_row)
EDIT :- The gradient is giving [None,None] as return value (for following code).
def grad(x, y):
with tf.GradientTape() as tape:
y_pred = logistic_regression(x)
loss_val = cross_entropy(y, y_pred)
return tape.gradient(loss_val, [w, b])
Examples values
loss_val => tf.Tensor(307700.47, shape=(), dtype=float32)
w => tf.Variable 'Variable:0' shape=(171, 1) dtype=float32, numpy=
array([[ 0.7456649 ], [-0.35111237],[-0.6848465 ],[ 0.22605407]]
b => tf.Variable 'Variable:0' shape=(1,) dtype=float32, numpy=array([1.1982833], dtype=float32)
In case of binary logistic regression, you don't required one_hot encoding. It generally used in multinomial logistic regression.
If you are doing ordinary (binary) logistic regression (with 0/1 labels), then use the loss function tf.nn.sigmoid_cross_entropy_with_logits().
If you are doing multiclass logistic regression (a.k.a softmax regression or multinomial logistic regission), then you have two choices:
Define your labels in 1-hot format (e.g. [1, 0, 0], [0, 1, 0], ...) and use the loss function tf.nn.softmax_cross_entropy_with_logits()
Define your labels as single integers (e.g. 1, 2, 3, ...) and use the loss function tf.nn.sparse_softmax_cross_entropy_with_logits()
For the latter two, you can find more information in this StackOverflow question:
What's the difference between sparse_softmax_cross_entropy_with_logits and softmax_cross_entropy_with_logits?

Select weight of action from a tensorflow model

I have a small model used in a reinforcement learning context.
I can input a 2d tensor of states, and I get a 2d tensor of action weigths.
Let say I input two states and I get the following action weights out:
[[0.1, 0.2],
[0.3, 0.4]]
Now I have another 2d tensor which have the action number from which I want to get the weights:
How can I use this tensor to get the weight of actions?
In this example I'd like to get:
Similar to Tensorflow tf.gather with axis parameter, the indices are handled little different here:
a = tf.constant( [[0.1, 0.2], [0.3, 0.4]])
indices = tf.constant([[1],[0]])
# convert to full indices
full_indices = tf.stack([tf.range(indices.shape[0])[...,tf.newaxis], indices], axis=2)
# gather
result = tf.gather_nd(a,full_indices)
with tf.Session() as sess:
A simple way to do this is squeeze the dimensions of indices, element-wise multiply with corresponding one-hot vector and then expand the dimensions later.
import tensorflow as tf
weights = tf.constant([[0.1, 0.2], [0.3, 0.4]])
indices = tf.constant([[1], [0]])
# Reduce from 2d (2, 1) to 1d (2,)
indices1d = tf.squeeze(indices)
# One-hot vector corresponding to the indices. shape (2, 2)
action_one_hot = tf.one_hot(indices=indices1d, depth=weights.shape[1])
# Element-wise multiplication and sum across axis 1 to pick the weight. Shape (2,)
action_taken_weight = tf.reduce_sum(action_one_hot * weights, axis=1)
# Expand the dimension back to have a 2d. Shape (2, 1)
action_taken_weight2d = tf.expand_dims(action_taken_weight, axis=1)
sess = tf.InteractiveSession()
print("weights\n", sess.run(weights))
print("indices\n", sess.run(indices))
print("indices1d\n", sess.run(indices1d))
print("action_one_hot\n", sess.run(action_one_hot))
print("action_taken_weight\n", sess.run(action_taken_weight))
print("action_taken_weight2d\n", sess.run(action_taken_weight2d))
Should give you the following output:
[[0.1 0.2]
[0.3 0.4]]
[1 0]
[[0. 1.]
[1. 0.]]
[0.2 0.3]
Note: You can also do action_taken_weight = tf.reshape(action_taken_weight, tf.shape(indices)) instead of expand_dims.

Tensorflow: Using Batch Normalization gives poor (erratic) validation loss and accuracy

I am trying to use Batch Normalization using tf.layers.batch_normalization() and my code looks like this:
def create_conv_exp_model(fingerprint_input, model_settings, is_training):
# Dropout placeholder
if is_training:
dropout_prob = tf.placeholder(tf.float32, name='dropout_prob')
# Mode placeholder
mode_placeholder = tf.placeholder(tf.bool, name="mode_placeholder")
he_init = tf.contrib.layers.variance_scaling_initializer(mode="FAN_AVG")
# Input Layer
input_frequency_size = model_settings['bins']
input_time_size = model_settings['spectrogram_length']
net = tf.reshape(fingerprint_input,
[-1, input_time_size, input_frequency_size, 1],
net = tf.layers.batch_normalization(net,
for i in range(1, 6):
net = tf.layers.conv2d(inputs=net,
kernel_size=[5, 5],
net = tf.layers.batch_normalization(net,
with tf.name_scope("relu_%d"%i):
net = tf.nn.relu(net)
net = tf.layers.max_pooling2d(net, [2, 2], [2, 2], 'SAME',
net_shape = net.get_shape().as_list()
net_height = net_shape[1]
net_width = net_shape[2]
net = tf.layers.conv2d( inputs=net,
kernel_size=[net_height, net_width],
strides=(net_height, net_width),
net = tf.layers.batch_normalization( net,
with tf.name_scope("relu_f"):
net = tf.nn.relu(net)
net = tf.layers.conv2d( inputs=net,
kernel_size=[1, 1],
### Squeeze
squeezed = tf.squeeze(net, axis=[1, 2], name="squeezed")
if is_training:
return squeezed, dropout_prob, mode_placeholder
return squeezed, mode_placeholder
And my train step looks like this:
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate_input)
gvs = optimizer.compute_gradients(cross_entropy_mean)
capped_gvs = [(tf.clip_by_value(grad, -2., 2.), var) for grad, var in gvs]
train_step = optimizer.apply_gradients(gvs))
During training, I am feeding the graph with:
train_summary, train_accuracy, cross_entropy_value, _, _ = sess.run(
merged_summaries, evaluation_step, cross_entropy_mean, train_step,
fingerprint_input: train_fingerprints,
ground_truth_input: train_ground_truth,
learning_rate_input: learning_rate_value,
dropout_prob: 0.5,
mode_placeholder: True
During validation,
validation_summary, validation_accuracy, conf_matrix = sess.run(
[merged_summaries, evaluation_step, confusion_matrix],
fingerprint_input: validation_fingerprints,
ground_truth_input: validation_ground_truth,
dropout_prob: 1.0,
mode_placeholder: False
My loss and accuracy curves (orange is training, blue is validation):
Plot of loss vs number of iterations,
Plot of accuracy vs number of iterations
The validation loss (and accuracy) seem very erratic. Is my implementation of Batch Normalization wrong? Or is this normal with Batch Normalization and I should wait for more iterations?
You need to pass is_training to tf.layers.batch_normalization(..., training=is_training) or it tries to normalize the inference minibatches using the minibatch statistics instead of the training statistics, which is wrong.
There are mainly two things to check.
1. Are you sure that you are using batch normalization (BN) correctly in the train op?
If you read the layer documentation:
Note: when training, the moving_mean and moving_variance need to be updated.
By default the update ops are placed in tf.GraphKeys.UPDATE_OPS, so they
need to be added as a dependency to the train_op. Also, be sure to add
any batch_normalization ops before getting the update_ops collection.
Otherwise, update_ops will be empty, and training/inference will not work
For example:
x_norm = tf.layers.batch_normalization(x, training=training)
# ...
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
train_op = optimizer.minimize(loss)
2. Otherwise, try lowering the "momentum" in the BN.
During the training, in fact, the BN uses two moving averages of the mean and the variance that are supposed to approximate the population statistics. Mean and variance are initialized to 0 and 1 respectively and then, step by step, they are multiplied by the momentum value (default is 0.99) and added the new value*0.01. At inference (test) time, the normalization uses these statistics. For this reason, it takes these values a little while to arrive at the "real" mean and variance of the data.
The original BN paper can be found here:
I also observed oscillations in validation loss when adding batch norm before ReLU. We found that moving the batch norm after the ReLU resolved the issue.

How to properly use tf.metrics.accuracy?

I have some trouble using the accuracy function from tf.metrics for a multiple classification problem with logits as input.
My model output looks like:
logits = [[0.1, 0.5, 0.4],
[0.8, 0.1, 0.1],
[0.6, 0.3, 0.2]]
And my labels are one hot encoded vectors:
labels = [[0, 1, 0],
[1, 0, 0],
[0, 0, 1]]
When I try to do something like tf.metrics.accuracy(labels, logits) it never gives the correct result. I am obviously doing something wrong but I can't figure what it is.
The accuracy function tf.metrics.accuracy calculates how often predictions matches labels based on two local variables it creates: total and count, that are used to compute the frequency with which logits matches labels.
acc, acc_op = tf.metrics.accuracy(labels=tf.argmax(labels, 1),
print(sess.run([acc, acc_op]))
# Output
#[0.0, 0.66666669]
acc (accuracy): simply returns the metrics using total and count, doesnt update the metrics.
acc_op (update up): updates the metrics.
To understand why the acc returns 0.0, go through the details below.
Details using a simple example:
logits = tf.placeholder(tf.int64, [2,3])
labels = tf.Variable([[0, 1, 0], [1, 0, 1]])
acc, acc_op = tf.metrics.accuracy(labels=tf.argmax(labels, 1),
Initialize the variables:
Since metrics.accuracy creates two local variables total and count, we need to call local_variables_initializer() to initialize them.
sess = tf.Session()
stream_vars = [i for i in tf.local_variables()]
#[<tf.Variable 'accuracy/total:0' shape=() dtype=float32_ref>,
# <tf.Variable 'accuracy/count:0' shape=() dtype=float32_ref>]
Understanding update ops and accuracy calculation:
print('acc:',sess.run(acc, {logits:[[0,1,0],[1,0,1]]}))
#acc: 0.0
print('[total, count]:',sess.run(stream_vars))
#[total, count]: [0.0, 0.0]
The above returns 0.0 for accuracy as total and count are zeros, inspite of giving matching inputs.
print('ops:', sess.run(acc_op, {logits:[[0,1,0],[1,0,1]]}))
#ops: 1.0
print('[total, count]:',sess.run(stream_vars))
#[total, count]: [2.0, 2.0]
With the new inputs, the accuracy is calculated when the update op is called. Note: since all the logits and labels match, we get accuracy of 1.0 and the local variables total and count actually give total correctly predicted and the total comparisons made.
Now we call accuracy with the new inputs (not the update ops):
print('acc:', sess.run(acc,{logits:[[1,0,0],[0,1,0]]}))
#acc: 1.0
Accuracy call doesnt update the metrics with the new inputs, it just returns the value using the two local variables. Note: the logits and labels dont match in this case. Now calling update ops again:
#op: 0.75
print('[total, count]:',sess.run(stream_vars))
#[total, count]: [3.0, 4.0]
The metrics are updated to new inputs
For more information on how to use the metrics during training and how to reset them during validation, can be found here.
On TF 2.0, if you are using the tf.keras API, you can define a custom class myAccuracy which inherits from tf.keras.metrics.Accuracy, and overrides the update method like this:
# imports
# ...
class myAccuracy(tf.keras.metrics.Accuracy):
def update_state(self, y_true, y_pred, sample_weight=None):
y_true = tf.argmax(y_true,1)
y_pred = tf.argmax(y_pred,1)
return super(myAccuracy,self).update_state(y_true,y_pred,sample_weight)
Then, when compiling the model you can add metrics in the usual way.
from my_awesome_models import discriminador
from my_puzzling_datasets import train_dataset,test_dataset
# Train for 1 steps, validate for 1 steps
# 1/1 [==============================] - 3s 3s/step - loss: 0.1502 - accuracy: 0.9490 - val_loss: 0.1374 - val_accuracy: 0.9550
Or evaluate yout model over the whole dataset
#> [0.131587415933609, 0.95354694]
Applied on a cnn you can write:
x = tf.placeholder(tf.float32, shape=[None, x_len], name='input')
fc1 = ... # cnn's fully connected layer
keep_prob = tf.placeholder(tf.float32, name='keep_prob')
layer_fc_dropout = tf.nn.dropout(fc1, keep_prob, name='dropout')
y_pred = tf.nn.softmax(fc1, name='output')
logits = tf.argmax(y_pred, axis=1)
y_true = tf.placeholder(tf.float32, shape=[None, y_len], name='y_true')
acc, acc_op = tf.metrics.accuracy(labels=tf.argmax(y_true, axis=1), predictions=tf.argmax(y_pred, 1))
def print_accuracy(x_data, y_data, dropout=1.0):
accuracy = sess.run(acc_op, feed_dict = {y_true: y_data, x: x_data, keep_prob: dropout})
print('Accuracy: ', accuracy)
Extending the answer to TF2.0, the tutorial here explains clearly how to use tf.metrics for accuracy and loss.
Notice that it mentions that the metrics are reset after each epoch :
When label and predictions are one-hot-coded
def train_step(features, labels):
with tf.GradientTape() as tape:
prediction = model(features)
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=labels, logits=predictions))
gradients = tape.gradient(loss, model.trainable_weights)
optimizer.apply_gradients(zip(gradients, model.trainable_weights))
train_accuracy(tf.argmax(labels, 1), tf.argmax(predictions, 1))
Here how I use it:
test_accuracy = tf.keras.metrics.Accuracy()
# use dataset api or normal dataset from lists/np arrays
ds_test_batch = zip(x_test,y_test)
predicted_classes = np.array([])
for (x, y) in ds_test_batch:
# training=False is needed only if there are layers with different
# behaviour during training versus inference (e.g. Dropout).
#Ajust the input similar to your input during the training
logits = model(x.reshape(1,-1), training=False )
prediction = tf.argmax(logits, axis=1, output_type=tf.int64)
predicted_classes = np.concatenate([predicted_classes,prediction.numpy()])
test_accuracy(prediction, y)
print("Test set accuracy: {:.3%}".format(test_accuracy.result()))