what's the difference between softmax_cross_entropy_with_logits and losses.log_loss? - tensorflow

whats the primary difference between tf.nn.softmax_cross_entropy_with_logits and tf.losses.log_loss? both methods accept 1-hot labels and logits to calculate cross entropy loss for classification tasks.

Those methods are not so different in theory, however have number of differences in implementation:
1) tf.nn.softmax_cross_entropy_with_logitsis designed for single-class labels, while tf.losses.log_losscan be used for multi-class classification. tf.nn.softmax_cross_entropy_with_logits won't throw an error if you feed multi-class labels, however your gradients won't be calculated correctly and training most probably will fail.
From official documentation:
NOTE: While the classes are mutually exclusive, their probabilities need not be. All that is required is that each row of labels is a valid probability distribution. If they are not, the computation of the gradient will be incorrect.
2) tf.nn.softmax_cross_entropy_with_logits calculates (as it's seen from the name) soft-max function on top of your predictions first, while log_loss doesn't do this.
3) tf.losses.log_loss has a little wider functionality in a sense that you can weight each element of the loss function or you can specify epsilon, which is used in calculations, to avoid log(0) value.
4) Finally, tf.nn.softmax_cross_entropy_with_logits returns loss for every entry in the batch, while tf.losses.log_loss returns reduced (sum over all samples by default) value which can be directly used in optimizer.
UPD: Another difference is the way the calculate the loss, Logarithmic loss takes into account negative classes (those where you have 0s in the vector). Shortly, cross-enthropy loss forces network to produce maximum input for the correct class and does not care about negative classes. Logarithmic loss does both at the same time, it forces correct classes to have larger values and negative lesser. In mathematic expression it looks as following:
Cross-enthropy loss:
Logarithmic Loss:
Where i is the corresponding class.
So for example, if you have labels=[1,0] and predictions_with_softmax = [0.7,0.3], then:
1) Cross-Enthropy Loss: -(1 * log(0.7) + 0 * log(0.3)) = 0.3567
2) Logarithmic Loss: - (1*log(0.7) + (1-1) * log(1 - 0.7) +0*log(0.3) + (1-0) log (1- 0.3)) = - (log(0.7) + log (0.7)) = 0.7133
And then if you use default value for tf.losses.log_loss you then need to divide the log_loss output by the number of non-zero elements (here it's 2). So finally: tf.nn.log_loss = 0.7133 / 2 = 0.3566
In this case we got equal outputs, however it is not always the case

There are basically two differences between,
1) Labels used in tf.nn.softmax_cross_entropy_with_logits are the one hot version of labels used in tf.losses.log_loss.
2) tf.nn.softmax_cross_entropy_with_logits calcultes the softmax of logits internally before the calculation of the cross-entrophy.
Notice that tf.losses.log_loss also accepts one-hot encoded labels. However, tf.nn.softmax_cross_entropy_with_logits only accepts the labels with one-hot encoding.
Hope this helps.

Related

Custom Keras loss to minimize count of elements above a given threshold

I am trying to create a custom loss function for a regression problem that would minimize the number of elements that falls above a certain threshold. my code for this is:
import tensorflow as tf
epsilon = 0.000001
def custom_loss(actual, predicted): # loss
actual = actual * 12
predicted = predicted * 12
# outputs a value between 1 and 20
vector = tf.sqrt(2 * (tf.square(predicted - actual + epsilon)) / (predicted + actual + epsilon))
# Count number of elements above threshold value of 5
fail_count = tf.cast(tf.size(vector[vector>5]), tf.float32)
return fail_count
I however, run into the following error:
ValueError: No gradients provided for any variable: ...
How do I solve this problem?
I don't think you can use this loss function, because the loss does not vary smoothly as the model parameters vary - it will jump from one value to another different value as parameters pass a theshold point. So tensorflow can't calculate gradients, and so can't train the model.
It's the same reason that 'number of images incorrectly classified' isn't used as a loss function, and categorical cross-entropy, which does vary smoothly as parameters change, is used instead.
You may need to find a smoothly varying function that approximates what you want.
[Added after your response below...]
This might do it. It becomes closer to your function as temperature is reduced. But it may not have good training dynamics, and there could be better solutions out there. One approach might be to start training with relatively large temperature, and reduce it as training progresses.
temperature = 1.0
fail_count=tf.reduce_sum(tf.math.sigmoid((vector-5.)/temperature))

Binary classification of pairs with opposite labels

I have a data-set without labels, but I do have a way to get pairs of examples with opposite labels, that is given a pair x,z I know that their true labels are either 0,1 or 1,0.
So, I am building a model that accepts pairs of samples as input, and learns to classify them with opposite labels. Assuming I have an arbitrary model for predicting a single sample, y_hat = f(x), I am building a model with Keras that accepts pairs of samples (x,z) and outputs pairs of predictions, f(x), f(z). I then use a custom loss function that drives the model towards the correct direction: Given that a regular binary classifier is trained using the Binary Cross Entropy (BCE) to make the predicted and desired output "close", I use the negative BCE. Also, since BCE is not symmetric, I symmetrize it. So, the loss function I give the model.compile method is:
from tensorflow import keras
bce = keras.losses.BinaryCrossentropy()
def neg_sym_bce(y1, y2):
return (- 0.5 * (bce(y1, y2) + bce(y2, y1)))
My problem is, this model fails to learn to classify even a single pair of my data (I get f(x)~=f(z)~=0.5), and if I try to train it with synthetic "easy" data, it takes hundreds of epochs to converge (also on a single pair).
This made me suspect that it has to do with a "vanishing gradient" problem. Indeed, when I plot (see below) the loss for a single pair, which is a function of 2 variables (the 2 outputs), it is evident that there is a wide plateau around the 0.5, 0.5 point. It is also evident that the global minima is, as expected, around the points 0,1 and 1,0.
So, is there a way to deal with the vanishing gradient here? I read about the problem but the references I found deal with vanishing gradient in the network, not in the loss itself.
Or, is there another loss that can drive the model to predict opposite labels?
Think if your labels are always either 0,1 or 1,1 just use categorical_crossentropy for the loss.

Custom loss in Keras with softmax to one-hot

I have a model that outputs a Softmax, and I would like to develop a custom loss function. The desired behaviour would be:
1) Softmax to one-hot (normally I do numpy.argmax(softmax_vector) and set that index to 1 in a null vector, but this is not allowed in a loss function).
2) Multiply the resulting one-hot vector by my embedding matrix to get an embedding vector (in my context: the word-vector that is associated to a given word, where words have been tokenized and assigned to indices, or classes for the Softmax output).
3) Compare this vector with the target (this could be a normal Keras loss function).
I know how to write a custom loss function in general, but not to do this. I found this closely related question (unanswered), but my case is a bit different, since I would like to preserve my softmax output.
It is possible to mix tensorflow and keras in you customer loss function. Once you can access to all Tensorflow function, things become very easy. I just give you a example of how this function could be imlement.
import tensorflow as tf
def custom_loss(target, softmax):
max_indices = tf.argmax(softmax, -1)
# Get the embedding matrix. In Tensorflow, this can be directly done
# with tf.nn.embedding_lookup
embedding_vectors = tf.nn.embedding_lookup(you_embedding_matrix, max_indices)
# Do anything you want with normal keras loss function
loss = some_keras_loss_function(target, embedding_vectors)
loss = tf.reduce_mean(loss)
return loss
Fan Luo's answer points in the right direction, but ultimately will not work because it involves non-derivable operations. Note such operations are acceptable for the real value (a loss function takes a real value and a predicted value, non-derivable operations are only fine for the real value).
To be fair, that was what I was asking in the first place. It is not possible to do what I wanted, but we can get a similar and derivable behaviour:
1) Element-wise power of the softmax values. This makes smaller values much smaller. For example, with a power of 4 [0.5, 0.2, 0.7] becomes [0.0625, 0.0016, 0.2400]. Note that 0.2 is comparable to 0.7, but 0.0016 is negligible with respect to 0.24. The higher my_power is, the more similar to a one-hot the final result will be.
soft_extreme = Lambda(lambda x: x ** my_power)(softmax)
2) Importantly, both softmax and one-hot vectors are normalized, but not our "soft_extreme". First, find the sum of the array:
norm = tf.reduce_sum(soft_extreme, 1)
3) Normalize soft_extreme:
almost_one_hot = Lambda(lambda x: x / norm)(soft_extreme)
Note: Setting my_power too high in 1) will result in NaNs. If you need a better softmax to one-hot conversion, then you may do steps 1 to 3 two or more times in a row.
4) Finally we want the vector from the dictionary. Lookup is forbidden, but we can take the average vector using matrix multiplication. Because our soft_normalized is similar to one-hot encoding this average will be similar to the vector associated to the highest argument (original intended behaviour). The higher my_power is in (1), the truer this will be:
target_vectors = tf.tensordot(almost_one_hot, embedding_matrix, axes=[[1], [0]])
Note: This will not work directly using batches! In my case, I reshaped my "one hot" (from [batch, dictionary_length] to [batch, 1, dictionary_length] using tf.reshape. Then tiled my embedding_matrix batch times and finally used:
predicted_vectors = tf.matmul(reshaped_one_hot, tiled_embedding)
There may be more elegant solutions (or less memory-hungry, if tiling the embedding matrix is not an option), so feel free to explore more.

Tensorflow num_classes parameter of nce_loss()

My understanding of noise contrastive estimation is that we sample some vectors from our word embeddings (the negative sample), and then calculate the log-likelihood of each. Then we want to maximize the difference between the probability of the target word and the log-likelihood of each of the negative sample words (So if I'm correct about this, we want to optimize the loss function so that it gets as close to 1 as possible).
My question is this:
What is the purpose of the num_classes parameters to the nce_loss function? My best guess is that the number of classes is passed in so that Tensorflow knows the size of the distribution from which the negative samples our drawn, but this might not make sense, since we could just infer the size of the distribution from the variable itself. Otherwise, I can't think of a reason for why we would need to know the total possible number of classes, especially if the language model is only outputting k + 1 predictions (negative sample size + 1 for the target word).
Your guess is correct. The num_classes argument is used to sample negative labels from the log-uniform (Zipfian) distribution.
Here's the link to the source code:
# Sample the negative labels.
# sampled shape: [num_sampled] tensor
# true_expected_count shape = [batch_size, 1] tensor
# sampled_expected_count shape = [num_sampled] tensor
if sampled_values is None:
sampled_values = candidate_sampling_ops.log_uniform_candidate_sampler(
true_classes=labels,
num_true=num_true,
num_sampled=num_sampled,
unique=True,
range_max=num_classes)
The range_max=num_classes argument basically defines the shape of this distribution and also the range of the sampled values - [0, range_max). Note that this range can't be accurately inferred from the labels, because a particular mini-batch can have only small word ids, which would skew the distribution significantly.

TensorFlow: Are my logits in the right format for cross entropy function?

Alright, so I'm getting ready to run the tf.nn.softmax_cross_entropy_with_logits() function in Tensorflow.
It's my understanding that the 'logits' should be a Tensor of probabilities, each one corresponding to a certain pixel's probability that it is part of an image that will ultimately be a "dog" or a "truck" or whatever... a finite number of things.
These logits will get plugged into this cross entropy equation:
As I understand it, the logits are plugged into the right side of the equation. That is, they are the q of every x (image). If they were probabilities from 0 to 1... that would make sense to me. But when I'm running my code and ending up with a tensor of logits, I'm not getting probabilities. Instead I'm getting floats that are both positive and negative:
-0.07264724 -0.15262917 0.06612295 ..., -0.03235611 0.08587133 0.01897052 0.04655019 -0.20552202 0.08725972 ..., -0.02107313 -0.00567073 0.03241089 0.06872301 -0.20756687 0.01094618 ..., etc
So my question is... is that right? Do I have to somehow calculate all my logits and turn them into probabilities from 0 to 1?
The crucial thing to note is that tf.nn.softmax_cross_entropy_with_logits(logits, labels) performs an internal softmax on each row of logits so that they are interpretable as probabilities before they are fed to the cross entropy equation.
Therefore, the "logits" need not be probabilities (or even true log probabilities, as the name would suggest), because of the internal normalization that happens within that op.
An alternative way to write:
xent = tf.nn.softmax_cross_entropy_with_logits(logits, labels)
...would be:
softmax = tf.nn.softmax(logits)
xent = -tf.reduce_sum(labels * tf.log(softmax), 1)
However, this alternative would be (i) less numerically stable (since the softmax may compute much larger values) and (ii) less efficient (since some redundant computation would happen in the backprop). For real uses, we recommend that you use tf.nn.softmax_cross_entropy_with_logits().