Meaning of sparse in "sparse cross entropy loss"? - tensorflow

I read from the documentation:
tf.keras.losses.SparseCategoricalCrossentropy(
from_logits=False, reduction="auto", name="sparse_categorical_crossentropy"
)
Computes the crossentropy loss between the labels and predictions.
Use this crossentropy loss function when there are two or more label
classes. We expect labels to be provided as integers. If you want to
provide labels using one-hot representation, please use
CategoricalCrossentropy loss. There should be # classes floating point
values per feature for y_pred and a single floating point value per
feature for y_true.
Why is this called sparse categorical cross entropy? If anything, we are providing a more compact encoding of class labels (integers vs one-hot vectors).

I think this is because integer encoding is more compact than one-hot encoding and thus more suitable for encoding sparse binary data. In other words, integer encoding = better encoding for sparse binary data.
This can be handy when you have many possible labels (and samples), in which case a one-hot encoding can be significantly more wasteful than a simple integer per example.

Why exactly it is called like that is probably best answered by Keras devs. However, note that this sparse cross-entropy is only suitable for "sparse labels", where exactly one value is 1 and all others are 0 (if the labels were represented as a vector and not just an index).
On the other hand, the general CategoricalCrossentropy also works with targets that are not one-hot, i.e. any probability distribution. The values just need to be between 0 and 1 and sum to 1. This tends to be forgotten because the use case of one-hot targets is so common in current ML applications.

Related

Custom loss in Keras with softmax to one-hot

I have a model that outputs a Softmax, and I would like to develop a custom loss function. The desired behaviour would be:
1) Softmax to one-hot (normally I do numpy.argmax(softmax_vector) and set that index to 1 in a null vector, but this is not allowed in a loss function).
2) Multiply the resulting one-hot vector by my embedding matrix to get an embedding vector (in my context: the word-vector that is associated to a given word, where words have been tokenized and assigned to indices, or classes for the Softmax output).
3) Compare this vector with the target (this could be a normal Keras loss function).
I know how to write a custom loss function in general, but not to do this. I found this closely related question (unanswered), but my case is a bit different, since I would like to preserve my softmax output.
It is possible to mix tensorflow and keras in you customer loss function. Once you can access to all Tensorflow function, things become very easy. I just give you a example of how this function could be imlement.
import tensorflow as tf
def custom_loss(target, softmax):
max_indices = tf.argmax(softmax, -1)
# Get the embedding matrix. In Tensorflow, this can be directly done
# with tf.nn.embedding_lookup
embedding_vectors = tf.nn.embedding_lookup(you_embedding_matrix, max_indices)
# Do anything you want with normal keras loss function
loss = some_keras_loss_function(target, embedding_vectors)
loss = tf.reduce_mean(loss)
return loss
Fan Luo's answer points in the right direction, but ultimately will not work because it involves non-derivable operations. Note such operations are acceptable for the real value (a loss function takes a real value and a predicted value, non-derivable operations are only fine for the real value).
To be fair, that was what I was asking in the first place. It is not possible to do what I wanted, but we can get a similar and derivable behaviour:
1) Element-wise power of the softmax values. This makes smaller values much smaller. For example, with a power of 4 [0.5, 0.2, 0.7] becomes [0.0625, 0.0016, 0.2400]. Note that 0.2 is comparable to 0.7, but 0.0016 is negligible with respect to 0.24. The higher my_power is, the more similar to a one-hot the final result will be.
soft_extreme = Lambda(lambda x: x ** my_power)(softmax)
2) Importantly, both softmax and one-hot vectors are normalized, but not our "soft_extreme". First, find the sum of the array:
norm = tf.reduce_sum(soft_extreme, 1)
3) Normalize soft_extreme:
almost_one_hot = Lambda(lambda x: x / norm)(soft_extreme)
Note: Setting my_power too high in 1) will result in NaNs. If you need a better softmax to one-hot conversion, then you may do steps 1 to 3 two or more times in a row.
4) Finally we want the vector from the dictionary. Lookup is forbidden, but we can take the average vector using matrix multiplication. Because our soft_normalized is similar to one-hot encoding this average will be similar to the vector associated to the highest argument (original intended behaviour). The higher my_power is in (1), the truer this will be:
target_vectors = tf.tensordot(almost_one_hot, embedding_matrix, axes=[[1], [0]])
Note: This will not work directly using batches! In my case, I reshaped my "one hot" (from [batch, dictionary_length] to [batch, 1, dictionary_length] using tf.reshape. Then tiled my embedding_matrix batch times and finally used:
predicted_vectors = tf.matmul(reshaped_one_hot, tiled_embedding)
There may be more elegant solutions (or less memory-hungry, if tiling the embedding matrix is not an option), so feel free to explore more.

Counterpart to categorical crossentropy for not one-hot encoded labels

I'm building a neural network with KERAS, where my labels are vectors, where exactly 6 values are 1, while all the other values (around 7000) are zero. I'm currently using categorical_crossentropy as my loss function but the documentation says:
Note: when using the categorical_crossentropy loss, your targets should be in categorical format (e.g. if you have 10 classes, the target for each sample should be a 10-dimensional vector that is all-zeros except for a 1 at the index corresponding to the class of the sample).
So what would be the "right" error function if categoreical_crossentropy is only the right way for one-hot encoded labels?
You can use sparse_categorical_crossentropy as loss, which accepts integer class indices instead of one-hot encoded ones.

What is embedding_column doing in tensorflow

From the docs it seems to me that it is using a embedding matrix to transform a one-hot encoding like sparse input vector to a dense vector. But how is this different from just using a fully connected layer?
Summarizing the answer from comments to here.
The main difference is efficiency. Instead of having to encode data points in these very long one hot vectors and do matrix multiplication, using embedding_column allows you to use index vectors and do a matrix lookup.
To represent categories.
Both one-hot encoding and embedding column are options to represent categorical features.
One of the problem with one-hot encoding is that it doesn't encode any relationships between the categories. They are completely independent from each other, so the neural network has no way of knowing which ones are similar to each other.
This problem can be solved by representing a categorical feature with an embedding
column. The idea is that each category has a smaller vector. The values are weights, similar to the weights that are used for basic features in a neural network.
For more:
https://developers.googleblog.com/2017/11/introducing-tensorflow-feature-columns.html

What are the differences between all these cross-entropy losses in Keras and TensorFlow?

What are the differences between all these cross-entropy losses?
Keras is talking about
Binary cross-entropy
Categorical cross-entropy
Sparse categorical cross-entropy
While TensorFlow has
Softmax cross-entropy with logits
Sparse softmax cross-entropy with logits
Sigmoid cross-entropy with logits
What are the differences and relationships between them? What are the typical applications for them? What's the mathematical background? Are there other cross-entropy types that one should know? Are there any cross-entropy types without logits?
There is just one cross (Shannon) entropy defined as:
H(P||Q) = - SUM_i P(X=i) log Q(X=i)
In machine learning usage, P is the actual (ground truth) distribution, and Q is the predicted distribution. All the functions you listed are just helper functions which accepts different ways to represent P and Q.
There are basically 3 main things to consider:
there are either 2 possibles outcomes (binary classification) or more. If there are just two outcomes, then Q(X=1) = 1 - Q(X=0) so a single float in (0,1) identifies the whole distribution, this is why neural network in binary classification has a single output (and so does logistic regresssion). If there are K>2 possible outcomes one has to define K outputs (one per each Q(X=...))
one either produces proper probabilities (meaning that Q(X=i)>=0 and SUM_i Q(X=i) =1 or one just produces a "score" and has some fixed method of transforming score to probability. For example a single real number can be "transformed to probability" by taking sigmoid, and a set of real numbers can be transformed by taking their softmax and so on.
there is j such that P(X=j)=1 (there is one "true class", targets are "hard", like "this image represent a cat") or there are "soft targets" (like "we are 60% sure this is a cat, but for 40% it is actually a dog").
Depending on these three aspects, different helper function should be used:
outcomes what is in Q targets in P
-------------------------------------------------------------------------------
binary CE 2 probability any
categorical CE >2 probability soft
sparse categorical CE >2 probability hard
sigmoid CE with logits 2 score any
softmax CE with logits >2 score soft
sparse softmax CE with logits >2 score hard
In the end one could just use "categorical cross entropy", as this is how it is mathematically defined, however since things like hard targets or binary classification are very popular - modern ML libraries do provide these additional helper functions to make things simpler. In particular "stacking" sigmoid and cross entropy might be numerically unstable, but if one knows these two operations are applied together - there is a numerically stable version of them combined (which is implemented in TF).
It is important to notice that if you apply wrong helper function the code will usually still execute, but results will be wrong. For example if you apply softmax_* helper for binary classification with one output your network will be considered to always produce "True" at the output.
As a final note - this answer considers classification, it is slightly different when you consider multi label case (when a single point can have multiple labels), as then Ps do not sum to 1, and one should use sigmoid_cross_entropy_with_logits despite having multiple output units.
Logits
For this purpose, "logits" can be seen as the non-activated outputs of the model.
While Keras losses always take an "activated" output (you must apply "sigmoid" or "softmax" before the loss)
Tensorflow takes them with "logits" or "non-activated" (you should not apply "sigmoid" or "softmax" before the loss)
Losses "with logits" will apply the activation internally.
Some functions allow you to choose logits=True or logits=False, which will tell the function whether to "apply" or "not apply" the activations.
Sparse
Sparse functions use the target data (ground truth) as "integer labels": 0, 1, 2, 3, 4.....
Non-sparse functions use the target data as "one-hot labels": [1,0,0], [0,1,0], [0,0,1]
Binary crossentropy = Sigmoid crossentropy
Problem type:
single class (false/true); or
non-exclusive multiclass (many classes may be correct)
Model output shape: (batch, ..., >=1)
Activation: "sigmoid"
Categorical crossentropy = Softmax crossentropy
Problem type: exclusive classes (only one class may be correct)
Model output shape: (batch, ..., >=2)
Activation: "softmax"

tensorflow - softmax ignore negative labels (just like caffe) [duplicate]

This question already has answers here:
TensorFlow: How to handle void labeled data in image segmentation?
(2 answers)
Closed 5 years ago.
In Caffe, there is an option with its SoftmaxWithLoss function to ignore all negative labels (-1) in computing probabilities, so that only 0 or positive label probabilities add up to 1.
Is there a similar feature with Tensorflow softmax loss?
Just came up with a work-around --- I created a one-hot tensor on the label indices using tf.one_hot (with the depth set at the # of labels). tf.one_hot automatically zeros out all indices with -1 in the resulting one_hot tensor (of shape [batch, # of labels])
This enables softmax loss (i.e. tf.nn.softmax_cross_entropy_with_logits) to "ignore" all -1 labels.
I am not quite sure that your workaround is actually working.
Caffe's ignore_label in caffe semantically has to be considered as "label of a sample which has to be ignored", thus it has as an effect that the gradient for that sampl_e is not backpropagated, which is in no way guranteed by the use of a one hot vector.
On one hand, I expect any meaningful model to quickly learn to predict a zero value, or small enough value, for that specific entry, cause of the fact all samples will have a zero in that specific entry, so to say, backpropagated info due to errors in that prediction will vanish relativly fast.
On the other hand you need to be aware that, from a math point of view caffe's ignore_label and what you are doing are totally different.
Said this, I am new to TF and need the exact same feature as caffe's ignore_label.