In the custom estimator, output layer doesn't have activation.
logits = tf.layers.dense(net, params['n_classes'], activation=None)
then using sparse_softmax_cross_entropy to calculate loss
loss = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)
Questions
In general, output layer should also have activation function?
sparse_softmax_cross_entropy means using softmax as activation
function of the output layer when calculate the loss?
Computing the softmax and the cross entropy based on it "naively" can be numerically unstable. This is why it is recommended not to have an activation in your output layer (usually it would be tf.nn.softmax for classification). Instead, Tensorflow supplies loss functions such as sparse_softmax_cross_entropy which apply the softmax internally (in a numerically stable fashion) and then compute the cross entropy based on that. That is, you are supposed to supply model outputs without your own softmax (commonly called logits).
E.g. in the API docs for the softmax op you can usually find passages such as
WARNING: This op expects unscaled logits, since it performs a softmax on logits internally for efficiency. Do not call this op with the output of softmax, as it will produce incorrect results.
Related
I am using an LSTM for binary classification and initially tried a model with 1 unit in the output(Dense) layer with sigmoid as the activation function.
However, it didn't perform well and I saw a few notebooks where they used 2 units in the output layer(the layer immediately after the LSTM) with softmax as the activation function. Is there any advantage to using 2 output layers and using softmax instead of a single unit and sigmoid(For the purpose of binary classification)? I am using binary_crossentropy as the loss function
Softmax should be better than sigmoid as the slope of derivative of sigmoid would almost be closer to one(vanishing gradient problem)., which makes it difficult to classify. That might be the reason for softmax to perform better than sigmoid
The answer to the question in the header is potentially extremely obvious, given it is commonly referred to as "ArcFace Loss".
However, one part is confusing me:
I was reading through the following Keras implementation of Arcface loss:
https://github.com/4uiiurz1/keras-arcface
In it, note that the model.compile line still specifies loss='categorical_crossentropy'
Further, I see a lot of sources referring to Softmax as a loss function, which I had previously understood to instead be the activation function of the output layer for many classification neural networks.
Based on these two points of confusion, my current understanding is that the loss function, i.e. how the network actually calculates the number which represesents "magnitude of wrongness" for a given example is cross entropy regardless. And that ArcFace, like Softmax, is instead the activation function for the output layer.
Would this be correct? If so, why are Arcface and Softmax referred to as loss functions? If not, where might my confusion be coming from?
Based on my understanding. The two things that you are confused about are as follows -
Is ArcFace is a loss or an activation function ?
Is softmax a loss or an activation function ?
Is ArcFace is a loss or an activation function
Your assumption that ArcFace is an activation function is incorrect.
ArcFace is indeed a loss function.
If you go through the research paper, the authors have mentioned that they use the traditional softmax function as an activation function for the last layer.
(You can checkout the call function is metrics.py file. The last line is
out = tf.nn.softmax(logits)).
It means that after applying the additive angular margin penalty they have passed the logits to the softmax function only.
It might sound very confusing as ArcFace itself is a loss function,then why is it using softmax? The answer is pretty simple, just to get the probabilities of the classes.
So basically what they have done is that they have applied the
additive angular margin penalty, then passed the obtained logits to the
softmax to get the class probabilities and applied categorical cross
entropy loss on top of that.
To better understand the workflow checkout the below image -
ArcFace
I feel your confusion might be because of the fact that most people consider softmax to be a loss function, although it is not really a
loss. I have explained it in detail below.
Is Softmax a loss or an activation function
I feel that you are a bit confused between softmax and categorical crossentropy.
I will do my best to explain the differences between the two.
Softmax
Softmax is just a function and not a loss. It squishes the values between 0 and 1. It makes sure that the sum of all these values is equal to 1 i.e. it has a nice probabilistic interpretation.
Softmax Function
Cross Entropy Loss
This is actually a loss function. The general form of Cross Entropy loss is as follows -
Cross Entropy Loss
It has 2 variants -
Binary Cross Entropy Loss
Categorical Cross Entropy Loss
Binary Cross Entropy Loss
It is used for binary classification tasks.
Binary Cross Entropy Loss
Categorical Cross Entropy Loss / Softmax Loss
CCE loss is actually called the softmax loss.
It is used for multi-class classification because of the probabilistic interpretation provided by the softmax function.
Categorical Cross Entropy Loss
I'm using Keras to build a RNN model with CTC loss.
I found that when passed a tensor to a Dense layer with activation=None, and the outputs of this layer were all nan.
But when set activation='softmax', the outputs were normal not nan.
problem code (elements of logits are all nan):
logits = Dense(out_shape, activation = None, name="logits")(x_permute)#x_permute is a tensor with shape (?,1876,96)
loss_ctc = Lambda(ctc_lambda_func, name='ctc_my')(
[logits, labels, x_len, lab_len])
model = Model(inputs=[x, labels, x_len, lab_len], outputs=[loss_ctc])
model.compile(loss={'ctc_my': lambda y_true,y_pred: y_pred}, optimizer='adadelta')
normal code(elements of logits are not nan):
logits = Dense(out_shape, activation = None, name="logits")(x_permute)#x_permute is a tensor with shape (?,1876,96)
output = Activation(activation="softmax", name="softmax")(logits)
loss_ctc = Lambda(ctc_lambda_func, name='ctc_my')(
[output, labels, x_len, lab_len])
model = Model(inputs=[x, labels, x_len, lab_len], outputs=[loss_ctc])
model.compile(loss={'ctc_my': lambda y_true,y_pred: y_pred}, optimizer='adadelta')
def ctc_lambda_func(args):
y_pred, y_true, input_length, label_length = args
return ctc_batch_cost(y_true, y_pred,input_length,label_length)
Anyone helps? many thanks.
I may misunderstand you, but why would you want activation="none"?
Maybe what you want to use is linear activation?
Have a look at Keras Activation Functions
as per Klemen Grm
your neural network is completely linear. You might consider different activation functions (eg: tanh, sigmoid, linear) for your hidden and output layers. This both lets you constrain the output range, and will probably improve the learning properties of your network.
In addition to what Klemen says, for the last one you want a softmax,
that normalizes the outputs into probabilities.
Neural networks have to implement complex mapping functions hence they need activation functions that are non-linear in order to bring in the much needed non-linearity property that enables them to approximate any function. A neuron without an activation function is equivalent to a neuron with a linear activation function
In the training example in Keras documentation,
https://keras.io/getting-started/sequential-model-guide/#training
binary_crossentropy is used and sigmoid activation is added in the network's last layer, but is it necessary that add sigmoid in the last layer? As I found in the source code:
def binary_crossentropy(output, target, from_logits=False):
"""Binary crossentropy between an output tensor and a target tensor.
Arguments:
output: A tensor.
target: A tensor with the same shape as `output`.
from_logits: Whether `output` is expected to be a logits tensor.
By default, we consider that `output`
encodes a probability distribution.
Returns:
A tensor.
"""
# Note: nn.softmax_cross_entropy_with_logits
# expects logits, Keras expects probabilities.
if not from_logits:
# transform back to logits
epsilon = _to_tensor(_EPSILON, output.dtype.base_dtype)
output = clip_ops.clip_by_value(output, epsilon, 1 - epsilon)
output = math_ops.log(output / (1 - output))
return nn.sigmoid_cross_entropy_with_logits(labels=target, logits=output)
Keras invokes sigmoid_cross_entropy_with_logits in Tensorflow, but in sigmoid_cross_entropy_with_logits function, sigmoid(logits) is calculated again.
https://www.tensorflow.org/versions/master/api_docs/python/tf/nn/sigmoid_cross_entropy_with_logits
So I don't think it makes sense that add a sigmoid at last, but seemingly all the binary/multi-label classification examples and tutorials in Keras I found online added sigmoid at last. Besides I don't understand what is the meaning of
# Note: nn.softmax_cross_entropy_with_logits
# expects logits, Keras expects probabilities.
Why Keras expects probabilities? Doesn't it use the nn.softmax_cross_entropy_with_logits function? Does it make sense?
Thanks.
You're right, that's exactly what's happening. I believe this is due to historical reasons.
Keras was created before tensorflow, as a wrapper around theano. And in theano, one has to compute sigmoid/softmax manually and then apply cross-entropy loss function. Tensorflow does everything in one fused op, but the API with sigmoid/softmax layer was already adopted by the community.
If you want to avoid unnecessary logit <-> probability conversions, call binary_crossentropy loss withfrom_logits=True and don't add the sigmoid layer.
In categorical cross entropy :
if it is prediction it will compute the cross entropy directly
if it is logit it will apply softmax_cross entropy with logit
In Binary cross entropy:
if it is prediction it will convert it back to logit then apply sigmoied cross entropy with logit
if it is logit it will apply sigmoied cross entropy with logitdirectly
In Keras by default we use activation sigmoid on the output layer and then use the keras binary_crossentropy loss function, independent of the backend implementation (Theano, Tensorflow or CNTK).
If you look more in depth for the pure Tensorflow case you find that the tensorflow backend binary_crossentropy function (which you pasted in your question) uses tf.nn.sigmoid_cross_entropy_with_logits. The later function also add the sigmoid activation. To avoid double sigmoid, the tensorflow backend binary_crossentropy, will by default (with from_logits=False) calculate the inverse sigmoid (logit(x)=log(x/1-x)) to get the output back into the raw state from the network with no activation.
The extra activation sigmoid, and inverse sigmoid calculation can be avoided by using no sigmoid activation function in your last layer, and then call the tensorflow backend binary_crossentropy with parameter from_logits=True (Or directly use tf.nn.sigmoid_cross_entropy_with_logits)
What might be the equivalent function of the following theano function in tensorflow?
Theano.tensor.nnet.categorical_crossentropy(o, y)
I think you would want to use softmax cross-entropy loss from Tensorflow. Remember that the input to this layer is unscaled logits i.e. you cannot feed softmax output to this layer. It will give wrong results.
Another important reason to use this loss instead of a combination of softmax + categorical cross-entropy is that the softmax loss is more stable. See this loss in Caffe. Also for some discussion about stability, see this.
For 2D tensors with probability distributions in the 2nd dimension:
def crossentropy(p_approx, p_true):
return -tf.reduce_sum(tf.multiply(p_true, tf.log(p_approx)), 1)