I want to ask a conceptual question about when to use one-hot encoding and when to use index to represent the labels in multi-class classification problems in tensorflow. I encountered the dimension problems about these, because I am not sure when to use which.
For example, in this fully connected NN example, one-hot encoding is proper. (https://github.com/aymericdamien/TensorFlow-Examples/blob/master/examples/3_NeuralNetworks/neural_network_raw.py)
But in this CNN example, index is proper. (https://github.com/aymericdamien/TensorFlow-Examples/blob/master/examples/3_NeuralNetworks/convolutional_network.py).
When I used one-hot encoding for the labels in the CNN example code, I got the error: "ValueError: Rank mismatch: Rank of labels (received 2) should equal rank of logits minus 1 (received 2)". But when I used index for labels, no problem.
Could someone explain when to use one-hot encoding and when to use index in tensorflow?
Related
I have been building a word-based neural machine translation model in Tensorflow using LSTMs. I have been following a couple of tutorials, including:
https://towardsdatascience.com/implementing-neural-machine-translation-using-keras-8312e4844eb8
My question is specifically about how the final Dense layer (with softmax activation) works.
All the words in the corpus are assigned to an integer. No word is assigned to the integer 0.
When you get your output from the final Dense (+ softmax) layer, what happens if index 0 has the maximum value? How does Tensorflow interpret this? No word in the target language has been assigned to the index 0. Yet this output needs to be fed as the input for the next time-step.
Could someone explain what's going on here?
According to my knowledge(please correct me if I'm wrong),
Multi-label classification(mutually inclusive) i.e., samples might have more than 1 correct values (for example movie genre, disease detection, etc).
Multi-Class classification(mutually exclusive) i.e., samples will always have 1 correct value (for example Cat or Dog, object detection, etc) this includes Binary Classification.
Assuming output is one-hot encoding.
What are the Loss function and metrics on has to use for these 2 types?
loss func. metrics
1. multi-label: (binary, categorical) (binary_accuracy, TopKCategorical accuracy, categorical_accuracy, AUC)
2. multi-class: (binary) (binary_accuracy,f1, recall, precision)
Please tell me from the above table which of them is/are more suitable, which of them is/are wrong & Why?
If you are trying to use multi-class classification provided that the labels (y) is one hot encoded, use the loss function as categorical crossentropy and use adam optimizer (It is suitable for most cases). Also, while using multi-class classification, the number of output nodes should be the same as the number of classes (or) labels. Say if your model is going to classify the input into 4 classes, You can configure the output layer as follows..
model.add(4, activation = "softmax")
Also, forgot to mention that softmax activation should be used in the output layer for multiclass classification problems.
Incase if your y is not one hot encoded, I would advise you to choose the loss function as sparse categorical crossentropy. No other changes will be necessary.
Also, I usually split the data into test data and train data and feed them to the model like this to get the accuracy in each epoch..
history = model.fit(train_data, validation_data = test_data, epochs = 10)
Hope it solved your problem.
I read from the documentation:
tf.keras.losses.SparseCategoricalCrossentropy(
from_logits=False, reduction="auto", name="sparse_categorical_crossentropy"
)
Computes the crossentropy loss between the labels and predictions.
Use this crossentropy loss function when there are two or more label
classes. We expect labels to be provided as integers. If you want to
provide labels using one-hot representation, please use
CategoricalCrossentropy loss. There should be # classes floating point
values per feature for y_pred and a single floating point value per
feature for y_true.
Why is this called sparse categorical cross entropy? If anything, we are providing a more compact encoding of class labels (integers vs one-hot vectors).
I think this is because integer encoding is more compact than one-hot encoding and thus more suitable for encoding sparse binary data. In other words, integer encoding = better encoding for sparse binary data.
This can be handy when you have many possible labels (and samples), in which case a one-hot encoding can be significantly more wasteful than a simple integer per example.
Why exactly it is called like that is probably best answered by Keras devs. However, note that this sparse cross-entropy is only suitable for "sparse labels", where exactly one value is 1 and all others are 0 (if the labels were represented as a vector and not just an index).
On the other hand, the general CategoricalCrossentropy also works with targets that are not one-hot, i.e. any probability distribution. The values just need to be between 0 and 1 and sum to 1. This tends to be forgotten because the use case of one-hot targets is so common in current ML applications.
I'm building a neural network with KERAS, where my labels are vectors, where exactly 6 values are 1, while all the other values (around 7000) are zero. I'm currently using categorical_crossentropy as my loss function but the documentation says:
Note: when using the categorical_crossentropy loss, your targets should be in categorical format (e.g. if you have 10 classes, the target for each sample should be a 10-dimensional vector that is all-zeros except for a 1 at the index corresponding to the class of the sample).
So what would be the "right" error function if categoreical_crossentropy is only the right way for one-hot encoded labels?
You can use sparse_categorical_crossentropy as loss, which accepts integer class indices instead of one-hot encoded ones.
From the docs it seems to me that it is using a embedding matrix to transform a one-hot encoding like sparse input vector to a dense vector. But how is this different from just using a fully connected layer?
Summarizing the answer from comments to here.
The main difference is efficiency. Instead of having to encode data points in these very long one hot vectors and do matrix multiplication, using embedding_column allows you to use index vectors and do a matrix lookup.
To represent categories.
Both one-hot encoding and embedding column are options to represent categorical features.
One of the problem with one-hot encoding is that it doesn't encode any relationships between the categories. They are completely independent from each other, so the neural network has no way of knowing which ones are similar to each other.
This problem can be solved by representing a categorical feature with an embedding
column. The idea is that each category has a smaller vector. The values are weights, similar to the weights that are used for basic features in a neural network.
For more:
https://developers.googleblog.com/2017/11/introducing-tensorflow-feature-columns.html