I'm trying to rewrite a Keras graph into a Tensorflow graph, but wonder which loss function is the equivalent of "Binary Cross Entropy". Is it tf.nn.softmax_cross_entropy_with_logits_v2?
Thanks a lot!

No, the implementation of the binary_crossentropy with tensorflow backend is defined here as
def binary_crossentropy(target, output, from_logits=False):
"""Binary crossentropy between an output tensor and a target tensor.
target: A tensor with the same shape as `output`.
output: A tensor.
from_logits: Whether `output` is expected to be a logits tensor.
By default, we consider that `output`
encodes a probability distribution.
A tensor.
# Note: nn.sigmoid_cross_entropy_with_logits
# expects logits, Keras expects probabilities.
if not from_logits:
# transform back to logits
epsilon_ = _to_tensor(epsilon(), output.dtype.base_dtype)
output = clip_ops.clip_by_value(output, epsilon_, 1 - epsilon_)
output = math_ops.log(output / (1 - output))
return nn.sigmoid_cross_entropy_with_logits(labels=target, logits=output)
Therefore, it uses sigmoid_crossentropy and not softmax_crossentropy.


Is it scaled twice in keras code categorical_crossentropy?

I see categorical_crossentropy is implemented in Keras as follows:
def categorical_crossentropy(target, output, from_logits=False, axis=-1):
"""Categorical crossentropy between an output tensor and a target tensor.
# Arguments
target: A tensor of the same shape as `output`.
output: A tensor resulting from a softmax
(unless `from_logits` is True, in which
case `output` is expected to be the logits).
from_logits: Boolean, whether `output` is the
result of a softmax, or is a tensor of logits.
axis: Int specifying the channels axis. `axis=-1`
corresponds to data format `channels_last`,
and `axis=1` corresponds to data format
# Returns
Output tensor.
# Raises
ValueError: if `axis` is neither -1 nor one of
the axes of `output`.
output_dimensions = list(range(len(output.get_shape())))
if axis != -1 and axis not in output_dimensions:
raise ValueError(
'Unexpected channels axis {}. '.format(axis),
'Expected to be -1 or one of the axes of `output`, ',
'which has {} dimensions.'.format(len(output.get_shape()))))
# Note: tf.nn.softmax_cross_entropy_with_logits
# expects logits, Keras expects probabilities.
if not from_logits:
# scale preds so that the class probas of each sample sum to 1
output /= tf.reduce_sum(output, axis, True)
# manual computation of crossentropy
_epsilon = _to_tensor(epsilon(), output.dtype.base_dtype)
output = tf.clip_by_value(output, _epsilon, 1. - _epsilon)
return - tf.reduce_sum(target * tf.log(output), axis)
I don't under stand from
output_dimensions = list(range(len(output.get_shape())))
output /= tf.reduce_sum(output, axis, True).
I understand Output is probabilities, a tensor resulting from a softmax -> It mean is scaled preds so that the class probas of each sample sum to 1. Why do they need to scale preds so that the probas class of each sample sum to 1 again? Please explain this.
Because you need to make sure that each probability is between 0 and 1, else the cross-entropy computation will be incorrect. Its a way to also prevent user errors when they make (unnormalized) probabilities outside that range.

Explanation of an implementation of the categorical_crossentropy

The formula for the categorical cross-entropy is the following.
What should the output of the last layer be? Should it be the probabilities of classes from a softmax layer?
What is the target?
How does the following code implement 1/N, the summation and pi,j?
def categorical_crossentropy(output, target, from_logits=False):
"""Categorical crossentropy between an output tensor and a target tensor.
# Arguments
output: A tensor resulting from a softmax
(unless `from_logits` is True, in which
case `output` is expected to be the logits).
target: A tensor of the same shape as `output`.
from_logits: Boolean, whether `output` is the
result of a softmax, or is a tensor of logits.
# Returns
Output tensor.
# Note: tf.nn.softmax_cross_entropy_with_logits
# expects logits, Keras expects probabilities.
if not from_logits:
# scale preds so that the class probas of each sample sum to 1
output /= tf.reduce_sum(output,
reduction_indices=len(output.get_shape()) - 1,
# manual computation of crossentropy
epsilon = _to_tensor(_EPSILON, output.dtype.base_dtype)
output = tf.clip_by_value(output, epsilon, 1. - epsilon)
return - tf.reduce_sum(target * tf.log(output),
reduction_indices=len(output.get_shape()) - 1)
return tf.nn.softmax_cross_entropy_with_logits(labels=target,
What should the output of the last layer be? Should it be the probabilities of classes from a softmax layer?
It can be either the output of the softmax layer or the raw logits (input to the softmax layer). The output vector of the softmax layer are the probabilities of each class. If output is the output of softmax then set from_logits=False. If output are the logits then you want to set from_logits=True. You can see internally that tf.nn.softmax_cross_entropy_with_logits is called, which computes the softmax probabilities and the cross-entropy function at the same time. Computing them together allows for some math tricks for numerical stability.
What is the target?
The target is a one-hot vector. This means that a number n is represented by a vector v where v[n] = 1 and 0 everywhere else. Here n is the class of the label. There is a function to get this encoding in TensoFlow called tf.one_hot. For example tf.one_hot([3],5) would result in the vector [0, 0, 1, 0, 0].
How does the following code implement 1/N, the summation and pi,j?
The code above does not average over all the inputs (no need for the "1/N"). For example, if the input is shaped [10, 5] the output would be shaped [10]. You would have to call tf.reduce_mean on the result. So the equation is essentially:
The above equation is implemented in the line
return - tf.reduce_sum(target * tf.log(output),
reduction_indices=len(output.get_shape()) - 1)
The "Σ" is tf.reduce_sum. "pi,j" is output, the indicator function (i.e. the bolded 1) is the one-hot encoded target.
Side Note
You should use the tf.softmax_cross_entropy_with_logits_v2, because the code you provided (when setting from_logits=False) could result in numerical errors. The combined function takes care of all of those numerical issues.

Using binary_crossentropy loss in Keras (Tensorflow backend)

In the training example in Keras documentation,
binary_crossentropy is used and sigmoid activation is added in the network's last layer, but is it necessary that add sigmoid in the last layer? As I found in the source code:
def binary_crossentropy(output, target, from_logits=False):
"""Binary crossentropy between an output tensor and a target tensor.
output: A tensor.
target: A tensor with the same shape as `output`.
from_logits: Whether `output` is expected to be a logits tensor.
By default, we consider that `output`
encodes a probability distribution.
A tensor.
# Note: nn.softmax_cross_entropy_with_logits
# expects logits, Keras expects probabilities.
if not from_logits:
# transform back to logits
epsilon = _to_tensor(_EPSILON, output.dtype.base_dtype)
output = clip_ops.clip_by_value(output, epsilon, 1 - epsilon)
output = math_ops.log(output / (1 - output))
return nn.sigmoid_cross_entropy_with_logits(labels=target, logits=output)
Keras invokes sigmoid_cross_entropy_with_logits in Tensorflow, but in sigmoid_cross_entropy_with_logits function, sigmoid(logits) is calculated again.
So I don't think it makes sense that add a sigmoid at last, but seemingly all the binary/multi-label classification examples and tutorials in Keras I found online added sigmoid at last. Besides I don't understand what is the meaning of
# Note: nn.softmax_cross_entropy_with_logits
# expects logits, Keras expects probabilities.
Why Keras expects probabilities? Doesn't it use the nn.softmax_cross_entropy_with_logits function? Does it make sense?
You're right, that's exactly what's happening. I believe this is due to historical reasons.
Keras was created before tensorflow, as a wrapper around theano. And in theano, one has to compute sigmoid/softmax manually and then apply cross-entropy loss function. Tensorflow does everything in one fused op, but the API with sigmoid/softmax layer was already adopted by the community.
If you want to avoid unnecessary logit <-> probability conversions, call binary_crossentropy loss withfrom_logits=True and don't add the sigmoid layer.
In categorical cross entropy :
if it is prediction it will compute the cross entropy directly
if it is logit it will apply softmax_cross entropy with logit
In Binary cross entropy:
if it is prediction it will convert it back to logit then apply sigmoied cross entropy with logit
if it is logit it will apply sigmoied cross entropy with logitdirectly
In Keras by default we use activation sigmoid on the output layer and then use the keras binary_crossentropy loss function, independent of the backend implementation (Theano, Tensorflow or CNTK).
If you look more in depth for the pure Tensorflow case you find that the tensorflow backend binary_crossentropy function (which you pasted in your question) uses tf.nn.sigmoid_cross_entropy_with_logits. The later function also add the sigmoid activation. To avoid double sigmoid, the tensorflow backend binary_crossentropy, will by default (with from_logits=False) calculate the inverse sigmoid (logit(x)=log(x/1-x)) to get the output back into the raw state from the network with no activation.
The extra activation sigmoid, and inverse sigmoid calculation can be avoided by using no sigmoid activation function in your last layer, and then call the tensorflow backend binary_crossentropy with parameter from_logits=True (Or directly use tf.nn.sigmoid_cross_entropy_with_logits)

Can you process a tensor in chunks in a custom Keras loss function?

I am trying to write a cusom Keras loss function in which I process the tensors in sub-vector chunks. For example, if an output tensor represented a concatenation of quaternion coefficients (i.e. w,x,y,z,w,x,y,z...) I might wish to normalize each quaternion before calculating the mean squared error in a loss function like:
def norm_quat_mse(y_true, y_pred):
diff = y_pred - y_true
dist = 0
for i in range(0,16,4):
dist += K.sum( K.square(diff[i:i+4] / K.sqrt(K.sum(K.square(diff[i:i+4])))))
return dist/4
While Keras will accept this function without error and use in training, it outputs a different loss value from when applied as an independent function and when using model.predict(), so I suspect it is not working properly. None of the built-in Keras loss functions use this per-chunk processing approach, is it possible to do this within Keras' auto-differentiation framework?
def norm_quat_mse(y_true, y_pred):
diff = y_pred - y_true
dist = 0
for i in range(0,16,4):
dist += K.sum( K.square(diff[:,i:i+4] / K.sqrt(K.sum(K.square(diff[:,i:i+4])))))
return dist/4
You need to know that shape of y_true and y_pred is (batch_size, output_size) so you need to skip first dimension during computations.

tensorflow tutorial of convolution, scale of logit

I am trying to edit my own model by adding some code to cifar10.py and here is the question.
In cifar10.py, the [tutorial][1] says:
EXERCISE: The output of inference are un-normalized logits. Try editing the network architecture to return normalized predictions using tf.nn.softmax().
So I directly input the output from "local4" to tf.nn.softmax(). This gives me the scaled logits which means the sum of all logits is 1.
But in the loss function, the cifar10.py code uses:
and description of this function says
WARNING: This op expects unscaled logits, since it performs a softmax on logits internally for efficiency. Do not call this op with the output of softmax, as it will produce incorrect results.
Also, according to the description, logits as input to above funtion must have the shape [batch_size, num_classes] and it means logits should be unscaled softmax, like sample code calculate unnormalized softmaxlogit as follow.
# softmax, i.e. softmax(WX + b)
with tf.variable_scope('softmax_linear') as scope:
weights = _variable_with_weight_decay('weights', [192, NUM_CLASSES],
stddev=1/192.0, wd=0.0)
biases = _variable_on_cpu('biases', [NUM_CLASSES],
softmax_linear = tf.add(tf.matmul(local4, weights), biases, name=scope.name)
Does this mean I don't have to use tf.nn.softmax in the code?
You can use tf.nn.softmax in the code if you want, but then you will have to compute the loss yourself:
softmax_logits = tf.nn.softmax(logits)
loss = tf.reduce_mean(- labels * tf.log(softmax_logits) - (1. - labels) * tf.log(1. - softmax_logits))
In practice, you don't use tf.nn.softmax for computing the loss. However you need to use tf.nn.softmax if for instance you want to compute the predictions of your algorithm and compare them to the true labels (to compute accuracy).