Why tensorflow use 'dim' parameter for softmax function? What kind of tensors we can use as input ?

tf.nn.softmax accepts in input a generic nonempty tensor.
You can decide to apply softmax on every dimension you want to.
Usually, softmax is applied to the last dimension (that's the default behavior) of the input tensor. This because usually softmax is applied to neural network output that's usually a tensor with a shape of [batch_size, num_classes].
However, you could decide to apply softmax to a tensor with a shape of [batch_size, num_classes, 2, 1] and compute the softmax only over the second dimension of the tensor: tf.nn.softmax(tensor, axis=1)


Data format and actual shape

I'm trying to migrate TensorFlow checkpoint weights to PyTorch.
When I extract some weights with cp.load_variable(<CKPT>, <FIELD_NAME>), I get a 4D list ordered as HWCN, for example [1, 1, 512, 1024] which is clearly HWCN.
However, all convolution blocks data_format are set to NHWC.
So, the question is, why there's mismatch?
what should I believe? does the 4D list from cp.load_variable is correct and all left to do is permute the dimensions?
The weights are not given as HWCN, as the weights do not have any batch dimension (N), otherwise that would apply a different weight for each sample in the batch. The shape is [kernel_height, kernel_width, in_channels, out_channels]. There is no mismatch, because data_format specifies which format the input and output use.
In PyTorch the weight of convolutions is given as [out_channels, in_channels, kernel_height, kernel_width], therefore you only need to permute the dimensions.

Explanation of an implementation of the categorical_crossentropy

The formula for the categorical cross-entropy is the following.
What should the output of the last layer be? Should it be the probabilities of classes from a softmax layer?
What is the target?
How does the following code implement 1/N, the summation and pi,j?
def categorical_crossentropy(output, target, from_logits=False):
"""Categorical crossentropy between an output tensor and a target tensor.
# Arguments
output: A tensor resulting from a softmax
(unless `from_logits` is True, in which
case `output` is expected to be the logits).
target: A tensor of the same shape as `output`.
from_logits: Boolean, whether `output` is the
result of a softmax, or is a tensor of logits.
# Returns
Output tensor.
# Note: tf.nn.softmax_cross_entropy_with_logits
# expects logits, Keras expects probabilities.
if not from_logits:
# scale preds so that the class probas of each sample sum to 1
output /= tf.reduce_sum(output,
reduction_indices=len(output.get_shape()) - 1,
# manual computation of crossentropy
epsilon = _to_tensor(_EPSILON, output.dtype.base_dtype)
output = tf.clip_by_value(output, epsilon, 1. - epsilon)
return - tf.reduce_sum(target * tf.log(output),
reduction_indices=len(output.get_shape()) - 1)
return tf.nn.softmax_cross_entropy_with_logits(labels=target,
What should the output of the last layer be? Should it be the probabilities of classes from a softmax layer?
It can be either the output of the softmax layer or the raw logits (input to the softmax layer). The output vector of the softmax layer are the probabilities of each class. If output is the output of softmax then set from_logits=False. If output are the logits then you want to set from_logits=True. You can see internally that tf.nn.softmax_cross_entropy_with_logits is called, which computes the softmax probabilities and the cross-entropy function at the same time. Computing them together allows for some math tricks for numerical stability.
What is the target?
The target is a one-hot vector. This means that a number n is represented by a vector v where v[n] = 1 and 0 everywhere else. Here n is the class of the label. There is a function to get this encoding in TensoFlow called tf.one_hot. For example tf.one_hot([3],5) would result in the vector [0, 0, 1, 0, 0].
How does the following code implement 1/N, the summation and pi,j?
The code above does not average over all the inputs (no need for the "1/N"). For example, if the input is shaped [10, 5] the output would be shaped [10]. You would have to call tf.reduce_mean on the result. So the equation is essentially:
The above equation is implemented in the line
return - tf.reduce_sum(target * tf.log(output),
reduction_indices=len(output.get_shape()) - 1)
The "Σ" is tf.reduce_sum. "pi,j" is output, the indicator function (i.e. the bolded 1) is the one-hot encoded target.
Side Note
You should use the tf.softmax_cross_entropy_with_logits_v2, because the code you provided (when setting from_logits=False) could result in numerical errors. The combined function takes care of all of those numerical issues.

Keras dense layer outputs are 'nan'

I'm using Keras to build a RNN model with CTC loss.
I found that when passed a tensor to a Dense layer with activation=None, and the outputs of this layer were all nan.
But when set activation='softmax', the outputs were normal not nan.
problem code (elements of logits are all nan):
logits = Dense(out_shape, activation = None, name="logits")(x_permute)#x_permute is a tensor with shape (?,1876,96)
loss_ctc = Lambda(ctc_lambda_func, name='ctc_my')(
[logits, labels, x_len, lab_len])
model = Model(inputs=[x, labels, x_len, lab_len], outputs=[loss_ctc])
model.compile(loss={'ctc_my': lambda y_true,y_pred: y_pred}, optimizer='adadelta')
normal code(elements of logits are not nan):
logits = Dense(out_shape, activation = None, name="logits")(x_permute)#x_permute is a tensor with shape (?,1876,96)
output = Activation(activation="softmax", name="softmax")(logits)
loss_ctc = Lambda(ctc_lambda_func, name='ctc_my')(
[output, labels, x_len, lab_len])
model = Model(inputs=[x, labels, x_len, lab_len], outputs=[loss_ctc])
model.compile(loss={'ctc_my': lambda y_true,y_pred: y_pred}, optimizer='adadelta')
def ctc_lambda_func(args):
y_pred, y_true, input_length, label_length = args
return ctc_batch_cost(y_true, y_pred,input_length,label_length)
Anyone helps? many thanks.
I may misunderstand you, but why would you want activation="none"?
Maybe what you want to use is linear activation?
Have a look at Keras Activation Functions
as per Klemen Grm
your neural network is completely linear. You might consider different activation functions (eg: tanh, sigmoid, linear) for your hidden and output layers. This both lets you constrain the output range, and will probably improve the learning properties of your network.
In addition to what Klemen says, for the last one you want a softmax,
that normalizes the outputs into probabilities.
Neural networks have to implement complex mapping functions hence they need activation functions that are non-linear in order to bring in the much needed non-linearity property that enables them to approximate any function. A neuron without an activation function is equivalent to a neuron with a linear activation function

Using binary_crossentropy loss in Keras (Tensorflow backend)

In the training example in Keras documentation,
binary_crossentropy is used and sigmoid activation is added in the network's last layer, but is it necessary that add sigmoid in the last layer? As I found in the source code:
def binary_crossentropy(output, target, from_logits=False):
"""Binary crossentropy between an output tensor and a target tensor.
output: A tensor.
target: A tensor with the same shape as `output`.
from_logits: Whether `output` is expected to be a logits tensor.
By default, we consider that `output`
encodes a probability distribution.
A tensor.
# Note: nn.softmax_cross_entropy_with_logits
# expects logits, Keras expects probabilities.
if not from_logits:
# transform back to logits
epsilon = _to_tensor(_EPSILON, output.dtype.base_dtype)
output = clip_ops.clip_by_value(output, epsilon, 1 - epsilon)
output = math_ops.log(output / (1 - output))
return nn.sigmoid_cross_entropy_with_logits(labels=target, logits=output)
Keras invokes sigmoid_cross_entropy_with_logits in Tensorflow, but in sigmoid_cross_entropy_with_logits function, sigmoid(logits) is calculated again.
So I don't think it makes sense that add a sigmoid at last, but seemingly all the binary/multi-label classification examples and tutorials in Keras I found online added sigmoid at last. Besides I don't understand what is the meaning of
# Note: nn.softmax_cross_entropy_with_logits
# expects logits, Keras expects probabilities.
Why Keras expects probabilities? Doesn't it use the nn.softmax_cross_entropy_with_logits function? Does it make sense?
You're right, that's exactly what's happening. I believe this is due to historical reasons.
Keras was created before tensorflow, as a wrapper around theano. And in theano, one has to compute sigmoid/softmax manually and then apply cross-entropy loss function. Tensorflow does everything in one fused op, but the API with sigmoid/softmax layer was already adopted by the community.
If you want to avoid unnecessary logit <-> probability conversions, call binary_crossentropy loss withfrom_logits=True and don't add the sigmoid layer.
In categorical cross entropy :
if it is prediction it will compute the cross entropy directly
if it is logit it will apply softmax_cross entropy with logit
In Binary cross entropy:
if it is prediction it will convert it back to logit then apply sigmoied cross entropy with logit
if it is logit it will apply sigmoied cross entropy with logitdirectly
In Keras by default we use activation sigmoid on the output layer and then use the keras binary_crossentropy loss function, independent of the backend implementation (Theano, Tensorflow or CNTK).
If you look more in depth for the pure Tensorflow case you find that the tensorflow backend binary_crossentropy function (which you pasted in your question) uses tf.nn.sigmoid_cross_entropy_with_logits. The later function also add the sigmoid activation. To avoid double sigmoid, the tensorflow backend binary_crossentropy, will by default (with from_logits=False) calculate the inverse sigmoid (logit(x)=log(x/1-x)) to get the output back into the raw state from the network with no activation.
The extra activation sigmoid, and inverse sigmoid calculation can be avoided by using no sigmoid activation function in your last layer, and then call the tensorflow backend binary_crossentropy with parameter from_logits=True (Or directly use tf.nn.sigmoid_cross_entropy_with_logits)

Per pixel softmax for fully convolutional network

I'm trying to implement something like a fully convolutional network, where the last convolution layer uses filter size 1x1 and outputs a 'score' tensor. The score tensor has shape [Batch, height, width, num_classes].
My question is, what function in tensorflow can apply softmax operation for each pixel, independent of other pixels. The tf.nn.softmax ops seems not for such purpose.
If there is no such ops available, I guess I have to write one myself.
UPDATE: if I do have to implement myself, I think I may need to reshape the input tensor to [N, num_claees] where N = Batch x width x height, and apply tf.nn.softmax, then reshape it back. Does it make sense?
Reshaping it to 2d and then reshaping it back, like you guessed, is the right approach.
You can use this function.
I found it by searching from GitHub.
import tensorflow as tf
Multi dimensional softmax,
refer to https://github.com/tensorflow/tensorflow/issues/210
compute softmax along the dimension of target
the native softmax only supports batch_size x dimension
def softmax(target, axis, name=None):
with tf.name_scope(name, 'softmax', values=[target]):
max_axis = tf.reduce_max(target, axis, keep_dims=True)
target_exp = tf.exp(target-max_axis)
normalize = tf.reduce_sum(target_exp, axis, keep_dims=True)
softmax = target_exp / normalize
return softmax