I'm building a neural network with KERAS, where my labels are vectors, where exactly 6 values are 1, while all the other values (around 7000) are zero. I'm currently using categorical_crossentropy as my loss function but the documentation says:
Note: when using the categorical_crossentropy loss, your targets should be in categorical format (e.g. if you have 10 classes, the target for each sample should be a 10-dimensional vector that is all-zeros except for a 1 at the index corresponding to the class of the sample).
So what would be the "right" error function if categoreical_crossentropy is only the right way for one-hot encoded labels?

You can use sparse_categorical_crossentropy as loss, which accepts integer class indices instead of one-hot encoded ones.


Which Loss function & Metrics is more suitable for multi-label classification? Binary or Categorical cross-entropy and Why?

According to my knowledge(please correct me if I'm wrong),
Multi-label classification(mutually inclusive) i.e., samples might have more than 1 correct values (for example movie genre, disease detection, etc).
Multi-Class classification(mutually exclusive) i.e., samples will always have 1 correct value (for example Cat or Dog, object detection, etc) this includes Binary Classification.
Assuming output is one-hot encoding.
What are the Loss function and metrics on has to use for these 2 types?
loss func. metrics
1. multi-label: (binary, categorical) (binary_accuracy, TopKCategorical accuracy, categorical_accuracy, AUC)
2. multi-class: (binary) (binary_accuracy,f1, recall, precision)
Please tell me from the above table which of them is/are more suitable, which of them is/are wrong & Why?
If you are trying to use multi-class classification provided that the labels (y) is one hot encoded, use the loss function as categorical crossentropy and use adam optimizer (It is suitable for most cases). Also, while using multi-class classification, the number of output nodes should be the same as the number of classes (or) labels. Say if your model is going to classify the input into 4 classes, You can configure the output layer as follows..
model.add(4, activation = "softmax")
Also, forgot to mention that softmax activation should be used in the output layer for multiclass classification problems.
Incase if your y is not one hot encoded, I would advise you to choose the loss function as sparse categorical crossentropy. No other changes will be necessary.
Also, I usually split the data into test data and train data and feed them to the model like this to get the accuracy in each epoch..
history =, validation_data = test_data, epochs = 10)
Hope it solved your problem.

Meaning of sparse in "sparse cross entropy loss"?

I read from the documentation:
from_logits=False, reduction="auto", name="sparse_categorical_crossentropy"
Computes the crossentropy loss between the labels and predictions.
Use this crossentropy loss function when there are two or more label
classes. We expect labels to be provided as integers. If you want to
provide labels using one-hot representation, please use
CategoricalCrossentropy loss. There should be # classes floating point
values per feature for y_pred and a single floating point value per
feature for y_true.
Why is this called sparse categorical cross entropy? If anything, we are providing a more compact encoding of class labels (integers vs one-hot vectors).
I think this is because integer encoding is more compact than one-hot encoding and thus more suitable for encoding sparse binary data. In other words, integer encoding = better encoding for sparse binary data.
This can be handy when you have many possible labels (and samples), in which case a one-hot encoding can be significantly more wasteful than a simple integer per example.
Why exactly it is called like that is probably best answered by Keras devs. However, note that this sparse cross-entropy is only suitable for "sparse labels", where exactly one value is 1 and all others are 0 (if the labels were represented as a vector and not just an index).
On the other hand, the general CategoricalCrossentropy also works with targets that are not one-hot, i.e. any probability distribution. The values just need to be between 0 and 1 and sum to 1. This tends to be forgotten because the use case of one-hot targets is so common in current ML applications.

Keras Conv3D Layer with Discrete Values

I'm trying to build a model that will learn features of a 3D space. Unlike image processing, the values of the 3D matrix are not continuous; they represent some discrete value of what "material" can be found at that specific coordinate (grass with value 1 or stairs with value 2 for example).
Is it possible to train a model to learn the features of the space without interpolating in-between values? For example, I don't want the neural net to deduce 1.5 to be some kind of grass stairs.
You'll want to use one-hot encoding, which represents categorical values as arrays of zeroes with a single value set to one. This means that grass (id = 1) would be [0, 1, 0, 0, ...] and stairs (id = 2) would be [0, 0, 1, 0, ...]. To perform one-hot encoding, look into keras' to_categorical function.
Further reading:
one-hot encoding tutorial
one-hot preprocessing using to_categorical
one-hot on the fly using an embedding layer
As any categorical model, this should be a "one-hot" data.
The "channels" dimension of your data should have a size of n-materials.
Values = 0 mean there is no presence of that material
Values = 1 mean there is presence of that material
So, your input shape will be something like (samples, spatial1, spatial2, spatial3, materials). If your data is currently shaped as (samples, s1, s2, s3) and has the materias as integers as you described, you can use to_categorical to transform the integers to "one-hot".
Although I am not sure if this is what you are asking for, I would imagine that t after the bottleneck of the convolutional network, one would typically use a flatten layer and then the output goes to a dense layer. The output layer, if using sigmoid activation will give you probabilities for each of the classes which have to be one-hot encoded, as others have suggested.
If you want the output of the network itself to be in discreet values, I suppose you can use some sort of step-wise activation function in the output layer. However you have to take care that your loss remains differentiable throughout the network (which is why such activation functions are not available in keras). This might be of interest:

TensorFlow Output: One-Hot Encoding vs. Index

I want to ask a conceptual question about when to use one-hot encoding and when to use index to represent the labels in multi-class classification problems in tensorflow. I encountered the dimension problems about these, because I am not sure when to use which.
For example, in this fully connected NN example, one-hot encoding is proper. (
But in this CNN example, index is proper. (
When I used one-hot encoding for the labels in the CNN example code, I got the error: "ValueError: Rank mismatch: Rank of labels (received 2) should equal rank of logits minus 1 (received 2)". But when I used index for labels, no problem.
Could someone explain when to use one-hot encoding and when to use index in tensorflow?

MultiClass Keras Classifier prediction output meaning

I have a Keras classifier built using the Keras wrapper of the Scikit-Learn API. The neural network has 10 output nodes, and the training data is all represented using one-hot encoding.
According to Tensorflow documentation, the predict function outputs a shape of (n_samples,). When I fitted 514541 samples, the function returned an array with shape (514541, ), and each entry of the array ranged from 0 to 9.
Since I have ten different outputs, does the numerical value of each entry correspond exactly to the result that I encoded in my training matrix?
i.e. if index 5 of my one-hot encoding of y_train represents "orange", does a prediction value of 5 mean that the neural network predicted "orange"?
Here is a sample of my model:
model = Sequential()
model.add(Dropout(0.2, input_shape=(32,) ))
model.add(Dense(21, activation='selu'))
model.add(Dense(10, activation='softmax'))
There are some issues with your question.
The neural network has 10 output nodes, and the training data is all represented using one-hot encoding.
Since your network has 10 output nodes, and your labels are one-hot encoded, your model's output should also be 10-dimensional, and again hot-encoded, i.e. of shape (n_samples, 10). Moreover, since you use a softmax activation for your final layer, each element of your 10-dimensional output should be in [0, 1], and interpreted as the probability of the output belonging to the respective (one-hot encoded) class.
According to Tensorflow documentation, the predict function outputs a shape of (n_samples,).
It's puzzling why you refer to Tensorflow, while your model is clearly a Keras one; you should refer to the predict method of the Keras sequential API.
When I fitted 514541 samples, the function returned an array with shape (514541, ), and each entry of the array ranged from 0 to 9.
If something like that happens, it must be due to a later part in your code that you do not show here; in any case, the idea would be to find the argument with the highest value from each 10-dimensional network output (since they are interpreted as probabilities, it is intuitive that the element with the highest value would be the most probable). In other words, somewhere in your code there must be something like this:
pred = model.predict(x_test)
y = np.argmax(pred, axis=1) # numpy must have been imported as np
which will give an array of shape (n_samples,), with each y an integer between 0 and 9, as you report.
i.e. if index 5 of my one-hot encoding of y_train represents "orange", does a prediction value of 5 mean that the neural network predicted "orange"?
Provided that the above hold, yes.