Explanation of an implementation of the categorical_crossentropy - tensorflow

The formula for the categorical cross-entropy is the following.
What should the output of the last layer be? Should it be the probabilities of classes from a softmax layer?
What is the target?
How does the following code implement 1/N, the summation and pi,j?
def categorical_crossentropy(output, target, from_logits=False):
"""Categorical crossentropy between an output tensor and a target tensor.
# Arguments
output: A tensor resulting from a softmax
(unless `from_logits` is True, in which
case `output` is expected to be the logits).
target: A tensor of the same shape as `output`.
from_logits: Boolean, whether `output` is the
result of a softmax, or is a tensor of logits.
# Returns
Output tensor.
"""
# Note: tf.nn.softmax_cross_entropy_with_logits
# expects logits, Keras expects probabilities.
if not from_logits:
# scale preds so that the class probas of each sample sum to 1
output /= tf.reduce_sum(output,
reduction_indices=len(output.get_shape()) - 1,
keep_dims=True)
# manual computation of crossentropy
epsilon = _to_tensor(_EPSILON, output.dtype.base_dtype)
output = tf.clip_by_value(output, epsilon, 1. - epsilon)
return - tf.reduce_sum(target * tf.log(output),
reduction_indices=len(output.get_shape()) - 1)
else:
return tf.nn.softmax_cross_entropy_with_logits(labels=target,
logits=output)

What should the output of the last layer be? Should it be the probabilities of classes from a softmax layer?
It can be either the output of the softmax layer or the raw logits (input to the softmax layer). The output vector of the softmax layer are the probabilities of each class. If output is the output of softmax then set from_logits=False. If output are the logits then you want to set from_logits=True. You can see internally that tf.nn.softmax_cross_entropy_with_logits is called, which computes the softmax probabilities and the cross-entropy function at the same time. Computing them together allows for some math tricks for numerical stability.
What is the target?
The target is a one-hot vector. This means that a number n is represented by a vector v where v[n] = 1 and 0 everywhere else. Here n is the class of the label. There is a function to get this encoding in TensoFlow called tf.one_hot. For example tf.one_hot([3],5) would result in the vector [0, 0, 1, 0, 0].
How does the following code implement 1/N, the summation and pi,j?
The code above does not average over all the inputs (no need for the "1/N"). For example, if the input is shaped [10, 5] the output would be shaped [10]. You would have to call tf.reduce_mean on the result. So the equation is essentially:
The above equation is implemented in the line
return - tf.reduce_sum(target * tf.log(output),
reduction_indices=len(output.get_shape()) - 1)
The "Σ" is tf.reduce_sum. "pi,j" is output, the indicator function (i.e. the bolded 1) is the one-hot encoded target.
Side Note
You should use the tf.softmax_cross_entropy_with_logits_v2, because the code you provided (when setting from_logits=False) could result in numerical errors. The combined function takes care of all of those numerical issues.

Related

Is it scaled twice in keras code categorical_crossentropy?

I see categorical_crossentropy is implemented in Keras as follows:
def categorical_crossentropy(target, output, from_logits=False, axis=-1):
"""Categorical crossentropy between an output tensor and a target tensor.
# Arguments
target: A tensor of the same shape as `output`.
output: A tensor resulting from a softmax
(unless `from_logits` is True, in which
case `output` is expected to be the logits).
from_logits: Boolean, whether `output` is the
result of a softmax, or is a tensor of logits.
axis: Int specifying the channels axis. `axis=-1`
corresponds to data format `channels_last`,
and `axis=1` corresponds to data format
`channels_first`.
# Returns
Output tensor.
# Raises
ValueError: if `axis` is neither -1 nor one of
the axes of `output`.
"""
output_dimensions = list(range(len(output.get_shape())))
if axis != -1 and axis not in output_dimensions:
raise ValueError(
'{}{}{}'.format(
'Unexpected channels axis {}. '.format(axis),
'Expected to be -1 or one of the axes of `output`, ',
'which has {} dimensions.'.format(len(output.get_shape()))))
# Note: tf.nn.softmax_cross_entropy_with_logits
# expects logits, Keras expects probabilities.
if not from_logits:
# scale preds so that the class probas of each sample sum to 1
output /= tf.reduce_sum(output, axis, True)
# manual computation of crossentropy
_epsilon = _to_tensor(epsilon(), output.dtype.base_dtype)
output = tf.clip_by_value(output, _epsilon, 1. - _epsilon)
return - tf.reduce_sum(target * tf.log(output), axis)
I don't under stand from
output_dimensions = list(range(len(output.get_shape())))
to
output /= tf.reduce_sum(output, axis, True).
I understand Output is probabilities, a tensor resulting from a softmax -> It mean is scaled preds so that the class probas of each sample sum to 1. Why do they need to scale preds so that the probas class of each sample sum to 1 again? Please explain this.
Because you need to make sure that each probability is between 0 and 1, else the cross-entropy computation will be incorrect. Its a way to also prevent user errors when they make (unnormalized) probabilities outside that range.

What is the Tensorflow loss equivalent of "Binary Cross Entropy"?

I'm trying to rewrite a Keras graph into a Tensorflow graph, but wonder which loss function is the equivalent of "Binary Cross Entropy". Is it tf.nn.softmax_cross_entropy_with_logits_v2?
Thanks a lot!
No, the implementation of the binary_crossentropy with tensorflow backend is defined here as
#tf_export('keras.backend.binary_crossentropy')
def binary_crossentropy(target, output, from_logits=False):
"""Binary crossentropy between an output tensor and a target tensor.
Arguments:
target: A tensor with the same shape as `output`.
output: A tensor.
from_logits: Whether `output` is expected to be a logits tensor.
By default, we consider that `output`
encodes a probability distribution.
Returns:
A tensor.
"""
# Note: nn.sigmoid_cross_entropy_with_logits
# expects logits, Keras expects probabilities.
if not from_logits:
# transform back to logits
epsilon_ = _to_tensor(epsilon(), output.dtype.base_dtype)
output = clip_ops.clip_by_value(output, epsilon_, 1 - epsilon_)
output = math_ops.log(output / (1 - output))
return nn.sigmoid_cross_entropy_with_logits(labels=target, logits=output)
Therefore, it uses sigmoid_crossentropy and not softmax_crossentropy.

Why tensorflow use 'dim' parameter for softmax function?

Why tensorflow use 'dim' parameter for softmax function? What kind of tensors we can use as input ?
tf.nn.softmax accepts in input a generic nonempty tensor.
You can decide to apply softmax on every dimension you want to.
Usually, softmax is applied to the last dimension (that's the default behavior) of the input tensor. This because usually softmax is applied to neural network output that's usually a tensor with a shape of [batch_size, num_classes].
However, you could decide to apply softmax to a tensor with a shape of [batch_size, num_classes, 2, 1] and compute the softmax only over the second dimension of the tensor: tf.nn.softmax(tensor, axis=1)

LSTM Followed by Mean Pooling (TensorFlow)

I am aware that there is a similar topic at LSTM Followed by Mean Pooling, but that is about Keras and I work in pure TensorFlow.
I have an LSTM network where the recurrence is handled by:
outputs, final_state = tf.nn.dynamic_rnn(cell,
embed,
sequence_length=seq_lengths,
initial_state=initial_state)
where I pass the correct sequence lengths for each sample (padding by zeros). In any case, outputs contains irrelevant outputs since some samples produce longer outputs than others, based on sequence lengths.
Right now I'm extracting the last relevant output by means of the following method:
def extract_axis_1(data, ind):
"""
Get specified elements along the first axis of tensor.
:param data: Tensorflow tensor that will be subsetted.
:param ind: Indices to take (one for each element along axis 0 of data).
:return: Subsetted tensor.
"""
batch_range = tf.range(tf.shape(data)[0])
indices = tf.stack([batch_range, ind], axis=1)
res = tf.reduce_mean(tf.gather_nd(data, indices), axis=0)
where I pass sequence_length - 1 as indices. In reference to the last topic, I would like to select all relevant outputs followed by average pooling, instead of just the last one.
Now, I tried passing nested lists as indeces to extract_axis_1 but tf.stack does not accept this.
Any solution directions for this?
You can exploit the weight parameter of the tf.contrib.seq2seq.sequence_loss function.
From the documentation:
weights: A Tensor of shape [batch_size, sequence_length] and dtype float. weights constitutes the weighting of each prediction in the sequence. When using weights as masking, set all valid timesteps to 1 and all padded timesteps to 0, e.g. a mask returned by tf.sequence_mask.
You need to compute a binary mask that distinguish between your valid outputs and invalid ones. Then you can just provide this mask to the weights parameter of the loss function (probably, you will want to use a loss like this one); the function will not consider the outputs with a 0 weight in the computation of the loss.
If you can't/don't need to use a sequence loss you can do exactly the same thing manually. You compute a binarymask and then multiply your outputs by this mask and provide these as inputs to your fully connected layer.

tensorflow tutorial of convolution, scale of logit

I am trying to edit my own model by adding some code to cifar10.py and here is the question.
In cifar10.py, the [tutorial][1] says:
EXERCISE: The output of inference are un-normalized logits. Try editing the network architecture to return normalized predictions using tf.nn.softmax().
So I directly input the output from "local4" to tf.nn.softmax(). This gives me the scaled logits which means the sum of all logits is 1.
But in the loss function, the cifar10.py code uses:
tf.nn.sparse_softmax_cross_entropy_with_logits()
and description of this function says
WARNING: This op expects unscaled logits, since it performs a softmax on logits internally for efficiency. Do not call this op with the output of softmax, as it will produce incorrect results.
Also, according to the description, logits as input to above funtion must have the shape [batch_size, num_classes] and it means logits should be unscaled softmax, like sample code calculate unnormalized softmaxlogit as follow.
# softmax, i.e. softmax(WX + b)
with tf.variable_scope('softmax_linear') as scope:
weights = _variable_with_weight_decay('weights', [192, NUM_CLASSES],
stddev=1/192.0, wd=0.0)
biases = _variable_on_cpu('biases', [NUM_CLASSES],
tf.constant_initializer(0.0))
softmax_linear = tf.add(tf.matmul(local4, weights), biases, name=scope.name)
_activation_summary(softmax_linear)
Does this mean I don't have to use tf.nn.softmax in the code?
You can use tf.nn.softmax in the code if you want, but then you will have to compute the loss yourself:
softmax_logits = tf.nn.softmax(logits)
loss = tf.reduce_mean(- labels * tf.log(softmax_logits) - (1. - labels) * tf.log(1. - softmax_logits))
In practice, you don't use tf.nn.softmax for computing the loss. However you need to use tf.nn.softmax if for instance you want to compute the predictions of your algorithm and compare them to the true labels (to compute accuracy).