How tf.nn.softmax_cross_entropy_with_logits can compute softmax cross entropy in tensorflow? - tensorflow

tf.nn.softmax_cross_entropy_with_logits, Documentation says that it computes softmax cross entropy between logits and labels what does it mean? Is it not applying cross entropy loss function formula on it? Why documentation says that it computes sofmax cross entropy?

Also from the Docs:
Measures the probability error in discrete classification tasks in which the classes are mutually exclusive (each entry is in exactly one class).
Softmax classification uses cross-entropy loss function to train and classify data among discrete classes. There are other activation functions used like ReLU (Rectified Linear Units) or Sigmoid that are used in Linear Classification and NN; in this case Softmax is used.
Activation functions are decision functions (the ones that actually classify data into categories) and cross-entropy is the function used to calculate the error during training (you could use other ways to calculate the error cost like mean squares). However, cross-entropy seems to be the currently the best way to calculate it.
As some point out, softmax cross-entropy is a commonly used term in Classification for convenient notation.
Edit
Regarding the logits, it means that it works with its input data unscaled. In other words, the input data may not be a probability value (i.e., values may be > 1). Check this question to know more about softmax_cross_entropy_with_logits and its components.

Related

Is ArcFace strictly a loss function or an activation function?

The answer to the question in the header is potentially extremely obvious, given it is commonly referred to as "ArcFace Loss".
However, one part is confusing me:
I was reading through the following Keras implementation of Arcface loss:
https://github.com/4uiiurz1/keras-arcface
In it, note that the model.compile line still specifies loss='categorical_crossentropy'
Further, I see a lot of sources referring to Softmax as a loss function, which I had previously understood to instead be the activation function of the output layer for many classification neural networks.
Based on these two points of confusion, my current understanding is that the loss function, i.e. how the network actually calculates the number which represesents "magnitude of wrongness" for a given example is cross entropy regardless. And that ArcFace, like Softmax, is instead the activation function for the output layer.
Would this be correct? If so, why are Arcface and Softmax referred to as loss functions? If not, where might my confusion be coming from?
Based on my understanding. The two things that you are confused about are as follows -
Is ArcFace is a loss or an activation function ?
Is softmax a loss or an activation function ?
Is ArcFace is a loss or an activation function
Your assumption that ArcFace is an activation function is incorrect.
ArcFace is indeed a loss function.
If you go through the research paper, the authors have mentioned that they use the traditional softmax function as an activation function for the last layer.
(You can checkout the call function is metrics.py file. The last line is
out = tf.nn.softmax(logits)).
It means that after applying the additive angular margin penalty they have passed the logits to the softmax function only.
It might sound very confusing as ArcFace itself is a loss function,then why is it using softmax? The answer is pretty simple, just to get the probabilities of the classes.
So basically what they have done is that they have applied the
additive angular margin penalty, then passed the obtained logits to the
softmax to get the class probabilities and applied categorical cross
entropy loss on top of that.
To better understand the workflow checkout the below image -
ArcFace
I feel your confusion might be because of the fact that most people consider softmax to be a loss function, although it is not really a
loss. I have explained it in detail below.
Is Softmax a loss or an activation function
I feel that you are a bit confused between softmax and categorical crossentropy.
I will do my best to explain the differences between the two.
Softmax
Softmax is just a function and not a loss. It squishes the values between 0 and 1. It makes sure that the sum of all these values is equal to 1 i.e. it has a nice probabilistic interpretation.
Softmax Function
Cross Entropy Loss
This is actually a loss function. The general form of Cross Entropy loss is as follows -
Cross Entropy Loss
It has 2 variants -
Binary Cross Entropy Loss
Categorical Cross Entropy Loss
Binary Cross Entropy Loss
It is used for binary classification tasks.
Binary Cross Entropy Loss
Categorical Cross Entropy Loss / Softmax Loss
CCE loss is actually called the softmax loss.
It is used for multi-class classification because of the probabilistic interpretation provided by the softmax function.
Categorical Cross Entropy Loss

How to use tensorflow pairwise loss

In tensorflow, there is a pairwise mean squared error function which takes in "predictions" it is not documented if this should be a sigmoid/softmax output or logits. https://www.tensorflow.org/api_docs/python/tf/losses/mean_pairwise_squared_error
I am looking to see if predictions must be a certain form for the input, or if there is a better pairwise loss function available.
The logits layer, in deep learning context is the layer on which the softmax function is applied. The softmax function is applied when want to perform multi-class classification. When we want to perform classification, the most common error measure is cross-entropy. On the other hand, the mean pairwise squared error is used in the context of regression. When we perform regression, we want to predict a real value as opposed to classification when we want to predict a class. With that said, the layer that will generate the outputs won't be a logits layer, but an ordinary linear layer. Moreover, the most common error measure when you want to perform regression is mean squared erorr.

What are the differences between all these cross-entropy losses in Keras and TensorFlow?

What are the differences between all these cross-entropy losses?
Keras is talking about
Binary cross-entropy
Categorical cross-entropy
Sparse categorical cross-entropy
While TensorFlow has
Softmax cross-entropy with logits
Sparse softmax cross-entropy with logits
Sigmoid cross-entropy with logits
What are the differences and relationships between them? What are the typical applications for them? What's the mathematical background? Are there other cross-entropy types that one should know? Are there any cross-entropy types without logits?
There is just one cross (Shannon) entropy defined as:
H(P||Q) = - SUM_i P(X=i) log Q(X=i)
In machine learning usage, P is the actual (ground truth) distribution, and Q is the predicted distribution. All the functions you listed are just helper functions which accepts different ways to represent P and Q.
There are basically 3 main things to consider:
there are either 2 possibles outcomes (binary classification) or more. If there are just two outcomes, then Q(X=1) = 1 - Q(X=0) so a single float in (0,1) identifies the whole distribution, this is why neural network in binary classification has a single output (and so does logistic regresssion). If there are K>2 possible outcomes one has to define K outputs (one per each Q(X=...))
one either produces proper probabilities (meaning that Q(X=i)>=0 and SUM_i Q(X=i) =1 or one just produces a "score" and has some fixed method of transforming score to probability. For example a single real number can be "transformed to probability" by taking sigmoid, and a set of real numbers can be transformed by taking their softmax and so on.
there is j such that P(X=j)=1 (there is one "true class", targets are "hard", like "this image represent a cat") or there are "soft targets" (like "we are 60% sure this is a cat, but for 40% it is actually a dog").
Depending on these three aspects, different helper function should be used:
outcomes what is in Q targets in P
-------------------------------------------------------------------------------
binary CE 2 probability any
categorical CE >2 probability soft
sparse categorical CE >2 probability hard
sigmoid CE with logits 2 score any
softmax CE with logits >2 score soft
sparse softmax CE with logits >2 score hard
In the end one could just use "categorical cross entropy", as this is how it is mathematically defined, however since things like hard targets or binary classification are very popular - modern ML libraries do provide these additional helper functions to make things simpler. In particular "stacking" sigmoid and cross entropy might be numerically unstable, but if one knows these two operations are applied together - there is a numerically stable version of them combined (which is implemented in TF).
It is important to notice that if you apply wrong helper function the code will usually still execute, but results will be wrong. For example if you apply softmax_* helper for binary classification with one output your network will be considered to always produce "True" at the output.
As a final note - this answer considers classification, it is slightly different when you consider multi label case (when a single point can have multiple labels), as then Ps do not sum to 1, and one should use sigmoid_cross_entropy_with_logits despite having multiple output units.
Logits
For this purpose, "logits" can be seen as the non-activated outputs of the model.
While Keras losses always take an "activated" output (you must apply "sigmoid" or "softmax" before the loss)
Tensorflow takes them with "logits" or "non-activated" (you should not apply "sigmoid" or "softmax" before the loss)
Losses "with logits" will apply the activation internally.
Some functions allow you to choose logits=True or logits=False, which will tell the function whether to "apply" or "not apply" the activations.
Sparse
Sparse functions use the target data (ground truth) as "integer labels": 0, 1, 2, 3, 4.....
Non-sparse functions use the target data as "one-hot labels": [1,0,0], [0,1,0], [0,0,1]
Binary crossentropy = Sigmoid crossentropy
Problem type:
single class (false/true); or
non-exclusive multiclass (many classes may be correct)
Model output shape: (batch, ..., >=1)
Activation: "sigmoid"
Categorical crossentropy = Softmax crossentropy
Problem type: exclusive classes (only one class may be correct)
Model output shape: (batch, ..., >=2)
Activation: "softmax"

What is the meaning of the word logits in TensorFlow? [duplicate]

This question already has answers here:
What are logits? What is the difference between softmax and softmax_cross_entropy_with_logits?
(8 answers)
Closed 2 years ago.
In the following TensorFlow function, we must feed the activation of artificial neurons in the final layer. That I understand. But I don't understand why it is called logits? Isn't that a mathematical function?
loss_function = tf.nn.softmax_cross_entropy_with_logits(
logits = last_layer,
labels = target_output
)
Logits is an overloaded term which can mean many different things:
In Math, Logit is a function that maps probabilities ([0, 1]) to R ((-inf, inf))
Probability of 0.5 corresponds to a logit of 0. Negative logit correspond to probabilities less than 0.5, positive to > 0.5.
In ML, it can be
the vector of raw (non-normalized) predictions that a classification
model generates, which is ordinarily then passed to a normalization
function. If the model is solving a multi-class classification
problem, logits typically become an input to the softmax function. The
softmax function then generates a vector of (normalized) probabilities
with one value for each possible class.
Logits also sometimes refer to the element-wise inverse of the sigmoid function.
Just adding this clarification so that anyone who scrolls down this much can at least gets it right, since there are so many wrong answers upvoted.
Diansheng's answer and JakeJ's answer get it right.
A new answer posted by Shital Shah is an even better and more complete answer.
Yes, logit as a mathematical function in statistics, but the logit used in context of neural networks is different. Statistical logit doesn't even make any sense here.
I couldn't find a formal definition anywhere, but logit basically means:
The raw predictions which come out of the last layer of the neural network.
1. This is the very tensor on which you apply the argmax function to get the predicted class.
2. This is the very tensor which you feed into the softmax function to get the probabilities for the predicted classes.
Also, from a tutorial on official tensorflow website:
Logits Layer
The final layer in our neural network is the logits layer, which will return the raw values for our predictions. We create a dense layer with 10 neurons (one for each target class 0–9), with linear activation (the default):
logits = tf.layers.dense(inputs=dropout, units=10)
If you are still confused, the situation is like this:
raw_predictions = neural_net(input_layer)
predicted_class_index_by_raw = argmax(raw_predictions)
probabilities = softmax(raw_predictions)
predicted_class_index_by_prob = argmax(probabilities)
where, predicted_class_index_by_raw and predicted_class_index_by_prob will be equal.
Another name for raw_predictions in the above code is logit.
As for the why logit... I have no idea. Sorry.
[Edit: See this answer for the historical motivations behind the term.]
Trivia
Although, if you want to, you can apply statistical logit to probabilities that come out of the softmax function.
If the probability of a certain class is p,
Then the log-odds of that class is L = logit(p).
Also, the probability of that class can be recovered as p = sigmoid(L), using the sigmoid function.
Not very useful to calculate log-odds though.
Summary
In context of deep learning the logits layer means the layer that feeds in to softmax (or other such normalization). The output of the softmax are the probabilities for the classification task and its input is logits layer. The logits layer typically produces values from -infinity to +infinity and the softmax layer transforms it to values from 0 to 1.
Historical Context
Where does this term comes from? In 1930s and 40s, several people were trying to adapt linear regression to the problem of predicting probabilities. However linear regression produces output from -infinity to +infinity while for probabilities our desired output is 0 to 1. One way to do this is by somehow mapping the probabilities 0 to 1 to -infinity to +infinity and then use linear regression as usual. One such mapping is cumulative normal distribution that was used by Chester Ittner Bliss in 1934 and he called this "probit" model, short for "probability unit". However this function is computationally expensive while lacking some of the desirable properties for multi-class classification. In 1944 Joseph Berkson used the function log(p/(1-p)) to do this mapping and called it logit, short for "logistic unit". The term logistic regression derived from this as well.
The Confusion
Unfortunately the term logits is abused in deep learning. From pure mathematical perspective logit is a function that performs above mapping. In deep learning people started calling the layer "logits layer" that feeds in to logit function. Then people started calling the output values of this layer "logit" creating the confusion with logit the function.
TensorFlow Code
Unfortunately TensorFlow code further adds in to confusion by names like tf.nn.softmax_cross_entropy_with_logits. What does logits mean here? It just means the input of the function is supposed to be the output of last neuron layer as described above. The _with_logits suffix is redundant, confusing and pointless. Functions should be named without regards to such very specific contexts because they are simply mathematical operations that can be performed on values derived from many other domains. In fact TensorFlow has another similar function sparse_softmax_cross_entropy where they fortunately forgot to add _with_logits suffix creating inconsistency and adding in to confusion. PyTorch on the other hand simply names its function without these kind of suffixes.
Reference
The Logit/Probit lecture slides is one of the best resource to understand logit. I have also updated Wikipedia article with some of above information.
Logit is a function that maps probabilities [0, 1] to [-inf, +inf].
Softmax is a function that maps [-inf, +inf] to [0, 1] similar as Sigmoid. But Softmax also normalizes the sum of the values(output vector) to be 1.
Tensorflow "with logit": It means that you are applying a softmax function to logit numbers to normalize it. The input_vector/logit is not normalized and can scale from [-inf, inf].
This normalization is used for multiclass classification problems. And for multilabel classification problems sigmoid normalization is used i.e. tf.nn.sigmoid_cross_entropy_with_logits
Personal understanding, in TensorFlow domain, logits are the values to be used as input to softmax. I came to this understanding based on this tensorflow tutorial.
https://www.tensorflow.org/tutorials/layers
Although it is true that logit is a function in maths(especially in statistics), I don't think that's the same 'logit' you are looking at. In the book Deep Learning by Ian Goodfellow, he mentioned,
The function σ−1(x) is called the logit in statistics, but this term
is more rarely used in machine learning. σ−1(x) stands for the
inverse function of logistic sigmoid function.
In TensorFlow, it is frequently seen as the name of last layer. In Chapter 10 of the book Hands-on Machine Learning with Scikit-learn and TensorFLow by Aurélien Géron, I came across this paragraph, which stated logits layer clearly.
note that logits is the output of the neural network before going
through the softmax activation function: for optimization reasons, we
will handle the softmax computation later.
That is to say, although we use softmax as the activation function in the last layer in our design, for ease of computation, we take out logits separately. This is because it is more efficient to calculate softmax and cross-entropy loss together. Remember that cross-entropy is a cost function, not used in forward propagation.
(FOMOsapiens).
If you check math Logit function, it converts real space from [0,1] interval to infinity [-inf, inf].
Sigmoid and softmax will do exactly the opposite thing. They will convert the [-inf, inf] real space to [0, 1] real space.
This is why, in machine learning we may use logit before sigmoid and softmax function (since they match).
And this is why "we may call" anything in machine learning that goes in front of sigmoid or softmax function the logit.
Here is G. Hinton video using this term.
Here is a concise answer for future readers. Tensorflow's logit is defined as the output of a neuron without applying activation function:
logit = w*x + b,
x: input, w: weight, b: bias. That's it.
The following is irrelevant to this question.
For historical lectures, read other answers. Hats off to Tensorflow's "creatively" confusing naming convention. In PyTorch, there is only one CrossEntropyLoss and it accepts un-activated outputs. Convolutions, matrix multiplications and activations are same level operations. The design is much more modular and less confusing. This is one of the reasons why I switched from Tensorflow to PyTorch.
logits
The vector of raw (non-normalized) predictions that a classification model generates, which is ordinarily then passed to a normalization function. If the model is solving a multi-class classification problem, logits typically become an input to the softmax function. The softmax function then generates a vector of (normalized) probabilities with one value for each possible class.
In addition, logits sometimes refer to the element-wise inverse of the sigmoid function. For more information, see tf.nn.sigmoid_cross_entropy_with_logits.
official tensorflow documentation
They are basically the fullest learned model you can get from the network, before it's been squashed down to apply to only the number of classes we are interested in. Check out how some researchers use them to train a shallow neural net based on what a deep network has learned: https://arxiv.org/pdf/1312.6184.pdf
It's kind of like how when learning a subject in detail, you will learn a great many minor points, but then when teaching a student, you will try to compress it to the simplest case. If the student now tried to teach, it'd be quite difficult, but would be able to describe it just well enough to use the language.
The logit (/ˈloʊdʒɪt/ LOH-jit) function is the inverse of the sigmoidal "logistic" function or logistic transform used in mathematics, especially in statistics. When the function's variable represents a probability p, the logit function gives the log-odds, or the logarithm of the odds p/(1 − p).
See here: https://en.wikipedia.org/wiki/Logit

softmax and sigmoid function for the output layer

In the deep learning implementations related to object detection and semantic segmentation, I have seen the output layers using either sigmoid or softmax. I am not very clear when to use which? It seems to me both of them can support these tasks. Are there any guidelines for this choice?
softmax() helps when you want a probability distribution, which sums up to 1. sigmoid is used when you want the output to be ranging from 0 to 1, but need not sum to 1.
In your case, you wish to classify and choose between two alternatives. I would recommend using softmax() as you will get a probability distribution which you can apply cross entropy loss function on.
The sigmoid and the softmax function have different purposes. For a detailed explanation of when to use sigmoid vs. softmax in neural network design, you can look at this article: "Classification: Sigmoid vs. Softmax."
Short summary:
If you have a multi-label classification problem where there is more than one "right answer" (the outputs are NOT mutually exclusive) then you can use a sigmoid function on each raw output independently. The sigmoid will allow you to have high probability for all of your classes, some of them, or none of them.
If you instead have a multi-class classification problem where there is only one "right answer" (the outputs are mutually exclusive), then use a softmax function. The softmax will enforce that the sum of the probabilities of your output classes are equal to one, so in order to increase the probability of a particular class, your model must correspondingly decrease the probability of at least one of the other classes.
Object detection is object classification used on a sliding window in the image. In classification it is important to find the correct output in some class space. E.g. you detect 10 different objects and you want to know which object is the most likely one in there. Then softmax is good because of its proberty that the whole layer sums up to 1.
Semantic segmentation on the other hand segments the image in some way. I have done semantic medical segmentation and there the output is a binary image. This means you can have sigmoid as output to predict if this pixel belongs to this specific class, because sigmoid values are between 0 and 1 for each output class.
In general Softmax is used (Softmax Classifier) when ‘n’ number of classes are there. Sigmoid or softmax both can be used for binary (n=2) classification.
Sigmoid:
S(x) = 1/ ( 1+ ( e^(-x) ))
Softmax:
σ(x)j = e / **Σ**{k=1 to K} e^zk for(j=1.....K)
Softmax is kind of Multi Class Sigmoid, but if you see the function of Softmax, the sum of all softmax units are supposed to be 1. In sigmoid it’s not really necessary.
Digging deep, you can also use sigmoid for multi-class classification. When you use a softmax, basically you get a probability of each class, (join distribution and a multinomial likelihood) whose sum is bound to be one. In case you use sigmoid for multi class classification, it’d be like a marginal distribution and a Bernoulli likelihood, p(y0/x) , p(y1/x) etc