I am working on a Semantic segmentation project where I have to work on multiclass data which is highly imbalanced. I searched for optimizing it during training using the model.fit parameter and in that to use class_weights or sample_weights.
I can implement a following using a class_weight dictionary as
{ 0:1, 1:10,2:15 }
I also saw a method of updating weights in loss function
But at what point do these weights get updated?
If class_weights are used where will it get penalized? I already have a kernel_regularizer for each layer so if my classes have to be penalized based on my class weights then will it penalize the output of each layer y=Wx+b or only at the final layer?
Same if I use a weighted loss function will it get penalized only on the final layer before loss calculation or on each layer and then the final loss is calculated?
Any explanation on this would be very useful.
The class_weights you mentioned in your dictionary are there to account for your imbalanced data. They will never change, they are only there to increase the penalty for misclassified instances of minority classes (that way your network pays more attention to them and the gradients returned treat one 'Class2' instance as if it was 15 times more important than one 'Class0' instance).
The kernel_regularizer you mention resides at your loss function and penalizes large weight norms for weight matrices throughout the network (if you use kernel_regularizer = tf.keras.regularizers.l1(0.01) in a Dense layer, it only affects that layer). So that is a different weight that has nothing to do with classes, only with weights inside your network. Your eventual loss will be something like loss = Cross_entropy + a * norm(Weight_matrix) and that way the network will have as an additional task assigned to it to minimize the classification loss (cross entropy) while the weight norms remain low.
I am using autoencoders to do anomaly detection. So, I have finished training my model and now I want to calculate the reconstruction loss for each entry in the dataset. so that I can assign anomalies to data points with high reconstruction loss.
This is my current code to calculate the reconstruction loss
But this is really slow. By my estimation, it should take 5 hours to go through the dataset whereas training one epoch occurs in approx 55 mins.
I feel that converting to tensor operation is bottlenecking the code, but I can't find a better way to do it.
I've tried changing the batch sizes but it does not make much of a difference. I have to use the convert to tensor part because K.eval is throwing an error if I do it normally.
python
for i in range(0, encoded_dataset.shape[0], batch_size):
y_true = tf.convert_to_tensor(encoded_dataset[i:i+batch_size].values,
np.float32)
y_pred= tf.convert_to_tensor(ae1.predict(encoded_dataset[i:i+batch_size].values),
np.float32)
# Append the batch losses (numpy array) to the list
reconstruction_loss_transaction.append(K.eval(loss_function( y_true, y_pred)))
I was able to train in 55 mins per epoch. So I feel prediction should not take 5 hours per epoch. encoded_dataset is a variable that has the entire dataset in main memory as a data frame.
I am using Azure VM instance.
K.eval(loss_function(y_true,y_pred) is to find the loss for each row of the batch
So y_true will be of size (batch_size,2000) and so will y_pred
K.eval(loss_function(y_true,y_pred) will give me an output of
(batch_size,1) evaluating binary cross entropy on each row of y
_true and y_pred
Moved from comments:
My suspicion is that ae1.predict and K.eval(loss_function) are behaving in unexpected ways. ae1.predict should normally be used to output the loss function value as well as y_pred. When you create the model, specify that the loss value is another output (you can have a list of multiple outputs), then just call predict here once to get both y_pred the loss value in one call.
But I want the loss for each row . Won't the loss returned by the predict method be the mean loss for the entire batch?
The answer depends on how the loss function is implemented. Both ways produce perfectly valid and identical results in TF under the hood. You could average the loss over the batch before taking the gradient w.r.t. the loss, or take the gradient w.r.t. a vector of losses. The gradient operation in TF will perform the averaging of the losses for you if you use the latter approach (see SO articles on taking the per-sample gradient, it's actually hard to do).
If Keras implements the loss with reduce_mean built into the loss, you could just define your own loss. If you're using square loss, replacing 'mean_squared_error' with lambda y_true, y_pred: tf.square(y_pred - y_true). That would produce square error instead of MSE (no difference to the gradient), but look here for the variant including the mean.
In any case this produces a per sample loss so long as you don't use tf.reduce_mean, which is purely optional in the loss. Another option is to simply compute the loss separately from what you optimize for and make that an output of the model, also perfectly valid.
The softmax cross-entropy with logits loss function is used to reduce the difference between the logits and labels provided to the function. Typically, the labels are fixed for supervised learning and the logits are adapted. But what happens when the labels come from a differentiable source, e.g., another network? Do both networks, i.e., the "logits network" and the "labels network" get trained by the subsequent optimizer, or does this loss function always treat the labels as fixed?
TLDR: Does tf.nn.softmax_cross_entropy_with_logits() also provide gradients for the labels (if they are differentiable), or are they always considered fixed?
Thanks!
You need to use tf.softmax_cross_entropy_with_logits_v2 to get gradients with respect to labels.
The gradient is calculated from loss provided to the optimizer, if the "labels" are coming from another trainable network, then yes, these will be modified, since they influence the loss. The correct way of using another networks outputs for your own is to define it as untrainable, or make a list of all variables you want to train and pass them to the optimizer explicitly.
This question already has answers here:
What are logits? What is the difference between softmax and softmax_cross_entropy_with_logits?
(8 answers)
Closed 2 years ago.
In the following TensorFlow function, we must feed the activation of artificial neurons in the final layer. That I understand. But I don't understand why it is called logits? Isn't that a mathematical function?
loss_function = tf.nn.softmax_cross_entropy_with_logits(
logits = last_layer,
labels = target_output
)
Logits is an overloaded term which can mean many different things:
In Math, Logit is a function that maps probabilities ([0, 1]) to R ((-inf, inf))
Probability of 0.5 corresponds to a logit of 0. Negative logit correspond to probabilities less than 0.5, positive to > 0.5.
In ML, it can be
the vector of raw (non-normalized) predictions that a classification
model generates, which is ordinarily then passed to a normalization
function. If the model is solving a multi-class classification
problem, logits typically become an input to the softmax function. The
softmax function then generates a vector of (normalized) probabilities
with one value for each possible class.
Logits also sometimes refer to the element-wise inverse of the sigmoid function.
Just adding this clarification so that anyone who scrolls down this much can at least gets it right, since there are so many wrong answers upvoted.
Diansheng's answer and JakeJ's answer get it right.
A new answer posted by Shital Shah is an even better and more complete answer.
Yes, logit as a mathematical function in statistics, but the logit used in context of neural networks is different. Statistical logit doesn't even make any sense here.
I couldn't find a formal definition anywhere, but logit basically means:
The raw predictions which come out of the last layer of the neural network.
1. This is the very tensor on which you apply the argmax function to get the predicted class.
2. This is the very tensor which you feed into the softmax function to get the probabilities for the predicted classes.
Also, from a tutorial on official tensorflow website:
Logits Layer
The final layer in our neural network is the logits layer, which will return the raw values for our predictions. We create a dense layer with 10 neurons (one for each target class 0–9), with linear activation (the default):
logits = tf.layers.dense(inputs=dropout, units=10)
If you are still confused, the situation is like this:
raw_predictions = neural_net(input_layer)
predicted_class_index_by_raw = argmax(raw_predictions)
probabilities = softmax(raw_predictions)
predicted_class_index_by_prob = argmax(probabilities)
where, predicted_class_index_by_raw and predicted_class_index_by_prob will be equal.
Another name for raw_predictions in the above code is logit.
As for the why logit... I have no idea. Sorry.
[Edit: See this answer for the historical motivations behind the term.]
Trivia
Although, if you want to, you can apply statistical logit to probabilities that come out of the softmax function.
If the probability of a certain class is p,
Then the log-odds of that class is L = logit(p).
Also, the probability of that class can be recovered as p = sigmoid(L), using the sigmoid function.
Not very useful to calculate log-odds though.
Summary
In context of deep learning the logits layer means the layer that feeds in to softmax (or other such normalization). The output of the softmax are the probabilities for the classification task and its input is logits layer. The logits layer typically produces values from -infinity to +infinity and the softmax layer transforms it to values from 0 to 1.
Historical Context
Where does this term comes from? In 1930s and 40s, several people were trying to adapt linear regression to the problem of predicting probabilities. However linear regression produces output from -infinity to +infinity while for probabilities our desired output is 0 to 1. One way to do this is by somehow mapping the probabilities 0 to 1 to -infinity to +infinity and then use linear regression as usual. One such mapping is cumulative normal distribution that was used by Chester Ittner Bliss in 1934 and he called this "probit" model, short for "probability unit". However this function is computationally expensive while lacking some of the desirable properties for multi-class classification. In 1944 Joseph Berkson used the function log(p/(1-p)) to do this mapping and called it logit, short for "logistic unit". The term logistic regression derived from this as well.
The Confusion
Unfortunately the term logits is abused in deep learning. From pure mathematical perspective logit is a function that performs above mapping. In deep learning people started calling the layer "logits layer" that feeds in to logit function. Then people started calling the output values of this layer "logit" creating the confusion with logit the function.
TensorFlow Code
Unfortunately TensorFlow code further adds in to confusion by names like tf.nn.softmax_cross_entropy_with_logits. What does logits mean here? It just means the input of the function is supposed to be the output of last neuron layer as described above. The _with_logits suffix is redundant, confusing and pointless. Functions should be named without regards to such very specific contexts because they are simply mathematical operations that can be performed on values derived from many other domains. In fact TensorFlow has another similar function sparse_softmax_cross_entropy where they fortunately forgot to add _with_logits suffix creating inconsistency and adding in to confusion. PyTorch on the other hand simply names its function without these kind of suffixes.
Reference
The Logit/Probit lecture slides is one of the best resource to understand logit. I have also updated Wikipedia article with some of above information.
Logit is a function that maps probabilities [0, 1] to [-inf, +inf].
Softmax is a function that maps [-inf, +inf] to [0, 1] similar as Sigmoid. But Softmax also normalizes the sum of the values(output vector) to be 1.
Tensorflow "with logit": It means that you are applying a softmax function to logit numbers to normalize it. The input_vector/logit is not normalized and can scale from [-inf, inf].
This normalization is used for multiclass classification problems. And for multilabel classification problems sigmoid normalization is used i.e. tf.nn.sigmoid_cross_entropy_with_logits
Personal understanding, in TensorFlow domain, logits are the values to be used as input to softmax. I came to this understanding based on this tensorflow tutorial.
https://www.tensorflow.org/tutorials/layers
Although it is true that logit is a function in maths(especially in statistics), I don't think that's the same 'logit' you are looking at. In the book Deep Learning by Ian Goodfellow, he mentioned,
The function σ−1(x) is called the logit in statistics, but this term
is more rarely used in machine learning. σ−1(x) stands for the
inverse function of logistic sigmoid function.
In TensorFlow, it is frequently seen as the name of last layer. In Chapter 10 of the book Hands-on Machine Learning with Scikit-learn and TensorFLow by Aurélien Géron, I came across this paragraph, which stated logits layer clearly.
note that logits is the output of the neural network before going
through the softmax activation function: for optimization reasons, we
will handle the softmax computation later.
That is to say, although we use softmax as the activation function in the last layer in our design, for ease of computation, we take out logits separately. This is because it is more efficient to calculate softmax and cross-entropy loss together. Remember that cross-entropy is a cost function, not used in forward propagation.
(FOMOsapiens).
If you check math Logit function, it converts real space from [0,1] interval to infinity [-inf, inf].
Sigmoid and softmax will do exactly the opposite thing. They will convert the [-inf, inf] real space to [0, 1] real space.
This is why, in machine learning we may use logit before sigmoid and softmax function (since they match).
And this is why "we may call" anything in machine learning that goes in front of sigmoid or softmax function the logit.
Here is G. Hinton video using this term.
Here is a concise answer for future readers. Tensorflow's logit is defined as the output of a neuron without applying activation function:
logit = w*x + b,
x: input, w: weight, b: bias. That's it.
The following is irrelevant to this question.
For historical lectures, read other answers. Hats off to Tensorflow's "creatively" confusing naming convention. In PyTorch, there is only one CrossEntropyLoss and it accepts un-activated outputs. Convolutions, matrix multiplications and activations are same level operations. The design is much more modular and less confusing. This is one of the reasons why I switched from Tensorflow to PyTorch.
logits
The vector of raw (non-normalized) predictions that a classification model generates, which is ordinarily then passed to a normalization function. If the model is solving a multi-class classification problem, logits typically become an input to the softmax function. The softmax function then generates a vector of (normalized) probabilities with one value for each possible class.
In addition, logits sometimes refer to the element-wise inverse of the sigmoid function. For more information, see tf.nn.sigmoid_cross_entropy_with_logits.
official tensorflow documentation
They are basically the fullest learned model you can get from the network, before it's been squashed down to apply to only the number of classes we are interested in. Check out how some researchers use them to train a shallow neural net based on what a deep network has learned: https://arxiv.org/pdf/1312.6184.pdf
It's kind of like how when learning a subject in detail, you will learn a great many minor points, but then when teaching a student, you will try to compress it to the simplest case. If the student now tried to teach, it'd be quite difficult, but would be able to describe it just well enough to use the language.
The logit (/ˈloʊdʒɪt/ LOH-jit) function is the inverse of the sigmoidal "logistic" function or logistic transform used in mathematics, especially in statistics. When the function's variable represents a probability p, the logit function gives the log-odds, or the logarithm of the odds p/(1 − p).
See here: https://en.wikipedia.org/wiki/Logit
How can I implement max norm constraints on the weights in an MLP in tensorflow? The kind that Hinton and Dean describe in their work on dark knowledge. That is, does tf.nn.dropout implement the weight constraints by default, or do we need to do it explicitly, as in
https://arxiv.org/pdf/1207.0580.pdf
"If these networks share the same weights for the hidden units that are present.
We use the standard, stochastic gradient descent procedure for training the dropout neural
networks on mini-batches of training cases, but we modify the penalty term that is normally
used to prevent the weights from growing too large. Instead of penalizing the squared length
(L2 norm) of the whole weight vector, we set an upper bound on the L2 norm of the incoming
weight vector for each individual hidden unit. If a weight-update violates this constraint, we
renormalize the weights of the hidden unit by division."
Keras appears to have it
http://keras.io/constraints/
tf.nn.dropout does not impose any norm constraint. I believe what you're looking for is to "process the gradients before applying them" using tf.clip_by_norm.
For example, instead of simply:
# Create an optimizer + implicitly call compute_gradients() and apply_gradients()
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)
You could:
# Create an optimizer.
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
# Compute the gradients for a list of variables.
grads_and_vars = optimizer.compute_gradients(loss, [weights1, weights2, ...])
# grads_and_vars is a list of tuples (gradient, variable).
# Do whatever you need to the 'gradient' part, for example cap them, etc.
capped_grads_and_vars = [(tf.clip_by_norm(gv[0], clip_norm=123.0, axes=0), gv[1])
for gv in grads_and_vars]
# Ask the optimizer to apply the capped gradients
optimizer = optimizer.apply_gradients(capped_grads_and_vars)
I hope this helps. Final notes about tf.clip_by_norm's axes parameter:
If you're calculating tf.nn.xw_plus_b(x, weights, biases), or equivalently matmul(x, weights) + biases, when the dimensions of x and weights are (batch, in_units) and (in_units, out_units) respectively, then you probably want to set axes == [0] (because in this usage each column details all incoming weights to a specific unit).
Pay attention to the shape/dimensions of your variables above and whether/how exactly you want to clip_by_norm each of them! E.g. if some of [weights1, weights2, ...] are matrices and some aren't, and you call clip_by_norm() on the grads_and_vars with the same axes value like in the List Comprehension above, this doesn't mean the same thing for all the variables! In fact, if you're lucky, this will result in a weird error like ValueError: Invalid reduction dimension 1 for input with 1 dimensions, but otherwise it's a very sneaky bug.
You can use tf.clip_by_value:
https://www.tensorflow.org/versions/r0.10/api_docs/python/train/gradient_clipping
Gradient clipping is also used to prevent weight explosion in recurrent neural networks.