How to softmax two types of labels in TensorFlow - tensorflow

TensorFlow is great and we have used it for image classification or recommendation system. We used softmax and cross entropy as loss function. It works if we have only one type of label. For example, we choose only one digit from 0 to 9 in MNIST dataset.
Now we have the features of gender and age. We have one-hot encoding for each example, such as [1, 0, 1, 0, 0, 0, 0]. The first two labels represent the gender and the last five labels represent the age. Each example has two 1s and the others should be 0s.
Now our code looks like this.
logits = inference(batch_features)
softmax = tf.nn.softmax(logits)
But I found that it "softmax" all the labels and sum up to 2. But what I expect is the first two sum up to 1 and the last five sum up to 1. Not sure how to implement that in TensorFlow because these 7(2+5) features seems the same.

You have your gender and age logits concatenated together.
You want the marginal predictions.
You need to split your logits (tf.slice) into two arrays and softmax them separately.
Just remember that this only gives you the marginal probabilities.
It can't represent "an old man or a young woman", as this doesn't factorize.
So you might want make joint predictions instead. 5x2 classes instead of 5+2 classes. Obviously this more powerful model is more prone to overfit.
If you have a lot of classes in each category you could build an intermediate model with a low rank factorization of the joint matrix, by adding together multiple marginal predictions. This gives Nxr+Mxr entries instead of N+M or NxM.

Related

Keras Conv3D Layer with Discrete Values

I'm trying to build a model that will learn features of a 3D space. Unlike image processing, the values of the 3D matrix are not continuous; they represent some discrete value of what "material" can be found at that specific coordinate (grass with value 1 or stairs with value 2 for example).
Is it possible to train a model to learn the features of the space without interpolating in-between values? For example, I don't want the neural net to deduce 1.5 to be some kind of grass stairs.
You'll want to use one-hot encoding, which represents categorical values as arrays of zeroes with a single value set to one. This means that grass (id = 1) would be [0, 1, 0, 0, ...] and stairs (id = 2) would be [0, 0, 1, 0, ...]. To perform one-hot encoding, look into keras' to_categorical function.
Further reading:
one-hot encoding tutorial
one-hot preprocessing using to_categorical
one-hot on the fly using an embedding layer
As any categorical model, this should be a "one-hot" data.
The "channels" dimension of your data should have a size of n-materials.
Values = 0 mean there is no presence of that material
Values = 1 mean there is presence of that material
So, your input shape will be something like (samples, spatial1, spatial2, spatial3, materials). If your data is currently shaped as (samples, s1, s2, s3) and has the materias as integers as you described, you can use to_categorical to transform the integers to "one-hot".
Although I am not sure if this is what you are asking for, I would imagine that t after the bottleneck of the convolutional network, one would typically use a flatten layer and then the output goes to a dense layer. The output layer, if using sigmoid activation will give you probabilities for each of the classes which have to be one-hot encoded, as others have suggested.
If you want the output of the network itself to be in discreet values, I suppose you can use some sort of step-wise activation function in the output layer. However you have to take care that your loss remains differentiable throughout the network (which is why such activation functions are not available in keras). This might be of interest: https://github.com/keras-team/keras/issues/7370

Custom loss in Keras with softmax to one-hot

I have a model that outputs a Softmax, and I would like to develop a custom loss function. The desired behaviour would be:
1) Softmax to one-hot (normally I do numpy.argmax(softmax_vector) and set that index to 1 in a null vector, but this is not allowed in a loss function).
2) Multiply the resulting one-hot vector by my embedding matrix to get an embedding vector (in my context: the word-vector that is associated to a given word, where words have been tokenized and assigned to indices, or classes for the Softmax output).
3) Compare this vector with the target (this could be a normal Keras loss function).
I know how to write a custom loss function in general, but not to do this. I found this closely related question (unanswered), but my case is a bit different, since I would like to preserve my softmax output.
It is possible to mix tensorflow and keras in you customer loss function. Once you can access to all Tensorflow function, things become very easy. I just give you a example of how this function could be imlement.
import tensorflow as tf
def custom_loss(target, softmax):
max_indices = tf.argmax(softmax, -1)
# Get the embedding matrix. In Tensorflow, this can be directly done
# with tf.nn.embedding_lookup
embedding_vectors = tf.nn.embedding_lookup(you_embedding_matrix, max_indices)
# Do anything you want with normal keras loss function
loss = some_keras_loss_function(target, embedding_vectors)
loss = tf.reduce_mean(loss)
return loss
Fan Luo's answer points in the right direction, but ultimately will not work because it involves non-derivable operations. Note such operations are acceptable for the real value (a loss function takes a real value and a predicted value, non-derivable operations are only fine for the real value).
To be fair, that was what I was asking in the first place. It is not possible to do what I wanted, but we can get a similar and derivable behaviour:
1) Element-wise power of the softmax values. This makes smaller values much smaller. For example, with a power of 4 [0.5, 0.2, 0.7] becomes [0.0625, 0.0016, 0.2400]. Note that 0.2 is comparable to 0.7, but 0.0016 is negligible with respect to 0.24. The higher my_power is, the more similar to a one-hot the final result will be.
soft_extreme = Lambda(lambda x: x ** my_power)(softmax)
2) Importantly, both softmax and one-hot vectors are normalized, but not our "soft_extreme". First, find the sum of the array:
norm = tf.reduce_sum(soft_extreme, 1)
3) Normalize soft_extreme:
almost_one_hot = Lambda(lambda x: x / norm)(soft_extreme)
Note: Setting my_power too high in 1) will result in NaNs. If you need a better softmax to one-hot conversion, then you may do steps 1 to 3 two or more times in a row.
4) Finally we want the vector from the dictionary. Lookup is forbidden, but we can take the average vector using matrix multiplication. Because our soft_normalized is similar to one-hot encoding this average will be similar to the vector associated to the highest argument (original intended behaviour). The higher my_power is in (1), the truer this will be:
target_vectors = tf.tensordot(almost_one_hot, embedding_matrix, axes=[[1], [0]])
Note: This will not work directly using batches! In my case, I reshaped my "one hot" (from [batch, dictionary_length] to [batch, 1, dictionary_length] using tf.reshape. Then tiled my embedding_matrix batch times and finally used:
predicted_vectors = tf.matmul(reshaped_one_hot, tiled_embedding)
There may be more elegant solutions (or less memory-hungry, if tiling the embedding matrix is not an option), so feel free to explore more.

Multilabel/ Multitask/ Multiclass Regression in machine learning

My challenge is to train a neural network to recognize certain actions and events for different classes of task or how you want to call it given the input.
I see that most of the input/output when training neural networks is either 0 or 1 or [0,1]. But in my scenario I want my input to be in the form of integers which are arbitrarily big and the same form is expected for the output.
Let me give you an example:
Input
X = [ 23, 4, 0, 1233423, 1, 0, 0] ->
Y = [ 2, 1, 1]
Now each element in X[i] represent different properties of the same entity.
Let's say it want to describe a human being:
23 -> maps to a place he/she was born
4 -> maps to a school they graduated
etc.
Each entry in Y[i], on the other hand, means what is more likely the human to do in 3 different categories ( as len(Y) is 3 in this case ):
Y[0] = 2 -> maps to eating icecream ( from a variety of other choices )
Y[1] = 1 -> maps to a time of day moment ( morning, noon, afternoon, evening, etc...)
Y[2] = 1 -> maps to a day of the week for example
Now of course if the task was just a multi label problem I would apply a sigmoid on the output layer and do a binary_crossentropy as the loss function but that of course does not work.
Here because my output is obviously not between [0,1].
Also I am not really sure what loss function to apply since I want all classes/subclasses in Y to be correctly predicted. What I am basically saying is that each Y[i] is itself is a class of its own.
It would be more accurate if my output was in the shape of (3, labels_per_class)
and the loss function would calculate a loss for each of the 3 different classes
trying to optimize the result in such a way that each of the 3 classes would have the correct labels.
I am not sure if that is possible or how at least.
I am really still in the beginnings with my neural network knowledge and learning so clearly I am struggling with this problem.
But really to put it more simply I have a better idea how to describe it. It is more or less like an auto-encoder but the inputs and outputs are integers. The difference is that in my case the output has a different size from the input where in the auto-encoder they are the same.
My solution was to apply a relu at the output layer, ( and of course relu-like activations on all other layers as well ) and binary_crossentropy as the loss functions but the accuracy of the network is very low, around 15%.
For a standard classification you would probably do a dense layer with a number of nodes equal to the number of classes then apply softmax. The loss would be tf.losses.softmax_cross_entropy. You would do a sigmoid if you want to allow multiple classes, not just one.
Now you have multiple classification tasks. One way to do it is to take the last hidden layer (the one before the one where you do softmax). For each task do a dense layer with a number of nodes equals to the number of classes for that task and apply softmax. To compute the loss just add the losses together.
If the tasks are too different you may want to have more than one layer for each prediction.
You can also put some weights on the different losses if, say, eating ice-cream is a lot more important than getting the time of day right.
Only use relu if the prediction space is continous. Say time of day is continous but the choice between eating ice-cream, going to work, watching TV is not. If you use relu use a loss like L1(tf.losses.absolut_difference) or L2 (tf.losses.mean_squared_error).

What are the differences between all these cross-entropy losses in Keras and TensorFlow?

What are the differences between all these cross-entropy losses?
Keras is talking about
Binary cross-entropy
Categorical cross-entropy
Sparse categorical cross-entropy
While TensorFlow has
Softmax cross-entropy with logits
Sparse softmax cross-entropy with logits
Sigmoid cross-entropy with logits
What are the differences and relationships between them? What are the typical applications for them? What's the mathematical background? Are there other cross-entropy types that one should know? Are there any cross-entropy types without logits?
There is just one cross (Shannon) entropy defined as:
H(P||Q) = - SUM_i P(X=i) log Q(X=i)
In machine learning usage, P is the actual (ground truth) distribution, and Q is the predicted distribution. All the functions you listed are just helper functions which accepts different ways to represent P and Q.
There are basically 3 main things to consider:
there are either 2 possibles outcomes (binary classification) or more. If there are just two outcomes, then Q(X=1) = 1 - Q(X=0) so a single float in (0,1) identifies the whole distribution, this is why neural network in binary classification has a single output (and so does logistic regresssion). If there are K>2 possible outcomes one has to define K outputs (one per each Q(X=...))
one either produces proper probabilities (meaning that Q(X=i)>=0 and SUM_i Q(X=i) =1 or one just produces a "score" and has some fixed method of transforming score to probability. For example a single real number can be "transformed to probability" by taking sigmoid, and a set of real numbers can be transformed by taking their softmax and so on.
there is j such that P(X=j)=1 (there is one "true class", targets are "hard", like "this image represent a cat") or there are "soft targets" (like "we are 60% sure this is a cat, but for 40% it is actually a dog").
Depending on these three aspects, different helper function should be used:
outcomes what is in Q targets in P
-------------------------------------------------------------------------------
binary CE 2 probability any
categorical CE >2 probability soft
sparse categorical CE >2 probability hard
sigmoid CE with logits 2 score any
softmax CE with logits >2 score soft
sparse softmax CE with logits >2 score hard
In the end one could just use "categorical cross entropy", as this is how it is mathematically defined, however since things like hard targets or binary classification are very popular - modern ML libraries do provide these additional helper functions to make things simpler. In particular "stacking" sigmoid and cross entropy might be numerically unstable, but if one knows these two operations are applied together - there is a numerically stable version of them combined (which is implemented in TF).
It is important to notice that if you apply wrong helper function the code will usually still execute, but results will be wrong. For example if you apply softmax_* helper for binary classification with one output your network will be considered to always produce "True" at the output.
As a final note - this answer considers classification, it is slightly different when you consider multi label case (when a single point can have multiple labels), as then Ps do not sum to 1, and one should use sigmoid_cross_entropy_with_logits despite having multiple output units.
Logits
For this purpose, "logits" can be seen as the non-activated outputs of the model.
While Keras losses always take an "activated" output (you must apply "sigmoid" or "softmax" before the loss)
Tensorflow takes them with "logits" or "non-activated" (you should not apply "sigmoid" or "softmax" before the loss)
Losses "with logits" will apply the activation internally.
Some functions allow you to choose logits=True or logits=False, which will tell the function whether to "apply" or "not apply" the activations.
Sparse
Sparse functions use the target data (ground truth) as "integer labels": 0, 1, 2, 3, 4.....
Non-sparse functions use the target data as "one-hot labels": [1,0,0], [0,1,0], [0,0,1]
Binary crossentropy = Sigmoid crossentropy
Problem type:
single class (false/true); or
non-exclusive multiclass (many classes may be correct)
Model output shape: (batch, ..., >=1)
Activation: "sigmoid"
Categorical crossentropy = Softmax crossentropy
Problem type: exclusive classes (only one class may be correct)
Model output shape: (batch, ..., >=2)
Activation: "softmax"

How can I determine several labels in parallel (in a neural network) by using a softmax-output-layer in tensorflow?

Due to the project work of my master study I am implementing a neural network using the tensorflow library form Google. At that I would like to determine (at the output layer of my feed forward neural network) several labels in parallel. And as activation function of the output layer I want to use the softmax function.
So what I want to have specifically is a output is a Vector that looks like this:
vec = [0.1, 0.8, 0.1, 0.3, 0.2, 0.5]
Here the first three numbers are the probabilities of the three classes of the first classification and the other three numbers are the probabilities of the three classes of the second classification. So in this case I would say that the labels are:
[ class2 , class3 ]
In a first attempt I tried to implement this by first reshapeing the (1x6) vector to a (2x3) Matrix with tf.reshape(), then apply the softmax-function on the matrix tf.nn.softmax() and finally reshape the matrix back to a vector. Unfortunately, due to the reshaping, the Gradient-Descent-Optimizer gets problems with calculating the gradient, so I tried something different.
What I do now is, I take the (1x6) vector and multiply it my a matrix that has a (3x3) identity-matrix in the upper part and a (3x3) zero-matrix in the lower part. Whit this I extract the first three entries of the vector. Then I can apply the softmax function and bring it back into the old form of (1x6) by another matrix multiplication. This has to be repeated for the other three vector entries as well.
outputSoftmax = tf.nn.softmax( vec * [[1,0,0],[0,1,0],[0,0,1],[0,0,0],[0,0,0],[0,0,0]] ) * tf.transpose( [[1,0,0],[0,1,0],[0,0,1],[0,0,0],[0,0,0],[0,0,0]] )
+ tf.nn.softmax( vec * [[0,0,0],[0,0,0],[0,0,0],[1,0,0],[0,1,0],[0,0,1]] ) * tf.transpose( [[0,0,0],[0,0,0],[0,0,0],[1,0,0],[0,1,0],[0,0,1]] )
It works so far, but I don't like this solution.
Because in my real problem, I not only have to determine two labels at a time but 91, I would have to repeat the procedure form above 91-times.
Does anyone have an solution, how I can obtain the desired vector, where the softmax function is applied on only three entries at a time, without writing the "same" code 91-times?
You could apply the tf.split function to obtain 91 tensors (one for each class), then apply softmax to each of them.
classes_split = tf.split(0, 91, all_in_one)
for c in classes_split:
softmax_class = tf.nn.softmax(c)
# use softmax_class to compute some loss, add it to overall loss
or instead of computing the loss directly, you could also concatenate them together again:
classes_split = tf.split(0, 91, all_in_one)
# softmax each split individually
classes_split_softmaxed = [tf.nn.softmax(c) for c in classes_split]
# Concatenate again
all_in_one_softmaxed = tf.concat(0, classes_split_softmaxed)