TensorFlow right loss function for Multi class and Multi label classification with ranking - tensorflow

I have 10 classes naming them 0 to 9
The output would look something like this
[0.0, 0.75, 0.0, 1.0, 0.0, 0.875, 0.0, 0.0, 0.0]
The above actual label is labeled in such a way that the 3rd index, which is also class 3 is Rank one and the 5th class is rank two, class 1 is ranked 3 and other classes are not relevant, so rank zero for the rest of them. In another word, the largest number has the highest rank, and so on. My main focus is on the ranks themselves, and I don't place importance on the specific values, such as 0.75 etc, that correspond to each rank.
Approach 1 - Regression
Last Dense layer with 10 neurons with linear activation function and loss as keras.losses.MeanSquaredError().
Approach 2 - Multiclass classification
Last Dense layer with 10 neurons with softmax activation function and loss as keras.losses.CategoricalCrossentropy(). With this approach, we can sort normalized predictions and put a threshold, and below that threshold all rank zero.
For Approach 1 my model is predicting mostly zero as that is the majority rank and for Approach 2 I am only getting Rank 1 rightly but other ranks are squashed down.
Is there any other approaches to look in to this problem ?

When you use the softmax, you are forcing to select one label and also penalize the other. The assumption is that one of the labels are correct.
In you case, you have a multi-label multi-class classifications, you can remove the softmax and use the crossentropy for each label and loss would be the sum of the loss for all classes. if you have labels which should be ignored as well you can manage there.

Related

Regression accuracy with neural network in low density regions

I am developing a neural net which needs to predict values between -1 and 1. However, I am only really concerned about the values at the ends of scale, say between -1 and -0.7 and between 0.7 and 1.
I do not mind if 0.6, for example, gets predicted to be 0.1. However, I do want to know if it's 0.8 or 0.9.
The distribution of my data is roughly normal, so there are many more samples in the range where I'm not concerned about the accuracy. It seems therefore that the training process is likely to lead to greater accuracy in the centre.
How can I configure the training or engineer my expected result to overcome this?
Thanks very much.
You could assign the observations to deciles, turn it into a classification problem and either assign a greater weight to the ranges you care about in the loss or just simply oversample them during training. By default, I'd go with weighing the classes in the loss function, as it is straight-forward to match with a weighted metric. Oversampling can be useful if you know that the distribution of your training data is different from the real data distribution.
To assign certain classes a greater weight in the loss function with Keras, you can pass a class_weight parameter to Model.fit. If label 0 is the first decile and label 9 is the last decile, you could double the weight of the first and last two deciles as follows:
class_weight = {
0: 2,
1: 2,
2: 1,
3: 1,
4: 1,
5: 1,
6: 1,
7: 1,
8: 2,
9: 2
}
model.fit(..., class_weight=class_weight)
To oversample certain classes, you'd include them more often in the batches than the class distribution would suggest. The simplest way to implement this is to sample observation indices with numpy.random.choice that has an optional parameter to specify probabilities for each entry. (Note that Keras Model.fit also has a sample_weight parameter where you can assign weights to each observation in the training data that will be applied when computing the loss function, but the intended use case is to weigh samples by the confidence in their labels, so I don't think it's applicable here.)

why not use the max value of output tensor instead of Softmax Function?

I built a CNN model on images one-class classification.
The output tensor is a list which has 65 elements. I make this tensor input to Softmax Function, and got the classified result.
I think the max value in this output tensor is the classified result, why not use this way to do classification task? Just the Softmax Function can be taken the derivative easily?
Softmax is used for multi-class classification. In multi-class class classification the model is expected to classify the input to single class with higher probability. Predicting with high probability enforces probabilities for other classes to be low.
As you stated one of the reason why one uses Softmax over max function is the softmax function is diffrential over Real Numbers and max function is not.
Edit:
There are some other properties of softmax function that makes it suitable to use for neural networks compared to max. Firstly it is soft version of max function. Let's say the logits of neural network has 4 outputs of [0.5, 0.5, 0.69, 0.7]. Hard max returns 1 for maximum index(in this case for 4th index) and 0 for other indexes. This results information loss.
Second important property of softmax is the output of sofmax function are in interval [0,1] and the sum of these values is equal to 1. For this reason the output of softmax function can be interpreted as probability. This means output can be considered as the confidence of the model to classify inputs to one of each output classes.

Custom loss in Keras with softmax to one-hot

I have a model that outputs a Softmax, and I would like to develop a custom loss function. The desired behaviour would be:
1) Softmax to one-hot (normally I do numpy.argmax(softmax_vector) and set that index to 1 in a null vector, but this is not allowed in a loss function).
2) Multiply the resulting one-hot vector by my embedding matrix to get an embedding vector (in my context: the word-vector that is associated to a given word, where words have been tokenized and assigned to indices, or classes for the Softmax output).
3) Compare this vector with the target (this could be a normal Keras loss function).
I know how to write a custom loss function in general, but not to do this. I found this closely related question (unanswered), but my case is a bit different, since I would like to preserve my softmax output.
It is possible to mix tensorflow and keras in you customer loss function. Once you can access to all Tensorflow function, things become very easy. I just give you a example of how this function could be imlement.
import tensorflow as tf
def custom_loss(target, softmax):
max_indices = tf.argmax(softmax, -1)
# Get the embedding matrix. In Tensorflow, this can be directly done
# with tf.nn.embedding_lookup
embedding_vectors = tf.nn.embedding_lookup(you_embedding_matrix, max_indices)
# Do anything you want with normal keras loss function
loss = some_keras_loss_function(target, embedding_vectors)
loss = tf.reduce_mean(loss)
return loss
Fan Luo's answer points in the right direction, but ultimately will not work because it involves non-derivable operations. Note such operations are acceptable for the real value (a loss function takes a real value and a predicted value, non-derivable operations are only fine for the real value).
To be fair, that was what I was asking in the first place. It is not possible to do what I wanted, but we can get a similar and derivable behaviour:
1) Element-wise power of the softmax values. This makes smaller values much smaller. For example, with a power of 4 [0.5, 0.2, 0.7] becomes [0.0625, 0.0016, 0.2400]. Note that 0.2 is comparable to 0.7, but 0.0016 is negligible with respect to 0.24. The higher my_power is, the more similar to a one-hot the final result will be.
soft_extreme = Lambda(lambda x: x ** my_power)(softmax)
2) Importantly, both softmax and one-hot vectors are normalized, but not our "soft_extreme". First, find the sum of the array:
norm = tf.reduce_sum(soft_extreme, 1)
3) Normalize soft_extreme:
almost_one_hot = Lambda(lambda x: x / norm)(soft_extreme)
Note: Setting my_power too high in 1) will result in NaNs. If you need a better softmax to one-hot conversion, then you may do steps 1 to 3 two or more times in a row.
4) Finally we want the vector from the dictionary. Lookup is forbidden, but we can take the average vector using matrix multiplication. Because our soft_normalized is similar to one-hot encoding this average will be similar to the vector associated to the highest argument (original intended behaviour). The higher my_power is in (1), the truer this will be:
target_vectors = tf.tensordot(almost_one_hot, embedding_matrix, axes=[[1], [0]])
Note: This will not work directly using batches! In my case, I reshaped my "one hot" (from [batch, dictionary_length] to [batch, 1, dictionary_length] using tf.reshape. Then tiled my embedding_matrix batch times and finally used:
predicted_vectors = tf.matmul(reshaped_one_hot, tiled_embedding)
There may be more elegant solutions (or less memory-hungry, if tiling the embedding matrix is not an option), so feel free to explore more.

How to softmax two types of labels in TensorFlow

TensorFlow is great and we have used it for image classification or recommendation system. We used softmax and cross entropy as loss function. It works if we have only one type of label. For example, we choose only one digit from 0 to 9 in MNIST dataset.
Now we have the features of gender and age. We have one-hot encoding for each example, such as [1, 0, 1, 0, 0, 0, 0]. The first two labels represent the gender and the last five labels represent the age. Each example has two 1s and the others should be 0s.
Now our code looks like this.
logits = inference(batch_features)
softmax = tf.nn.softmax(logits)
But I found that it "softmax" all the labels and sum up to 2. But what I expect is the first two sum up to 1 and the last five sum up to 1. Not sure how to implement that in TensorFlow because these 7(2+5) features seems the same.
You have your gender and age logits concatenated together.
You want the marginal predictions.
You need to split your logits (tf.slice) into two arrays and softmax them separately.
Just remember that this only gives you the marginal probabilities.
It can't represent "an old man or a young woman", as this doesn't factorize.
So you might want make joint predictions instead. 5x2 classes instead of 5+2 classes. Obviously this more powerful model is more prone to overfit.
If you have a lot of classes in each category you could build an intermediate model with a low rank factorization of the joint matrix, by adding together multiple marginal predictions. This gives Nxr+Mxr entries instead of N+M or NxM.

Tensorflow Loss for Non-Independent Classes

I am using a Tensorflow network for classification between classes that are similar to their neighboring classes, i.e. not independent. For example, let's say we want to predict among 10 classes but the predictions are not merely "correct" or "incorrect." Instead, if the correct class is 7 and network predicts 6, the loss should be less than if the network predicted 5, because 6 is closer to the correct answer than 5. My understanding is that cross entropy and 1-hot vectors provides "all or nothing" loss rather than a "continuous" loss that reflects the magnitude of the error. If that is correct, how does one implement such a continuous loss in Tensorflow?
-- Update June 13 2016 ----
An example application might be color recognition. If the network predicts "green" but the true color is yellow-green, then the loss should be less than if the network predicted blue because green is a better prediction than blue.
You can choose to implement a continuous function (e.g. hue from HSV) as a single output, and construct your own loss calculation that reflects what you want to optimize. In that case you'd just have a single output value that ranged between 0.0 and 1.0, and the loss would be evaluated based on the distance from the labeled value.