Output Format using Sparse Categorical Cross Entropy in Keras for Multi-Class Classification - tensorflow

I've built a u-net architecture using Keras Functional API but I'm having trouble using the sparse categorical cross entropy loss function. My learning task is multi-class, pixel-wise classification for many 256x256 images. The intended output is a 256x256 mask images with integer values from 0-31 (not every mask will contain each class). I have 32 classes so one-hot encoding gives me an OOM error which is why I don't use categorical cross entropy. The majority of the mask pixels are 0s (which may be part of the problem).
I keep getting loss = nan. I've normalized my input data to have mean = 0, std = 1. If I leave the masks as they are, I get an accuracy around 0.97 and the output masks are all 1s (which is obviously incorrect). If I add 1 to all my masks before performing training, the accuracy is 0. I'm using relu activations with a SoftMax in the last convolutional layer.
It seems the problem likely has to do with the format of my output data, so my main question is, what format should it be in for sparse categorical cross entropy? Should I normalize the mask values to be 0-1? Alternatively, are there any other loss functions or accuracy metrics I can use for training? As far as multi-class classification goes the only function I know of is categorical cross entropy. I can provide additional information about my data, network, etc. if needed.

Related

Which Loss function & Metrics is more suitable for multi-label classification? Binary or Categorical cross-entropy and Why?

According to my knowledge(please correct me if I'm wrong),
Multi-label classification(mutually inclusive) i.e., samples might have more than 1 correct values (for example movie genre, disease detection, etc).
Multi-Class classification(mutually exclusive) i.e., samples will always have 1 correct value (for example Cat or Dog, object detection, etc) this includes Binary Classification.
Assuming output is one-hot encoding.
What are the Loss function and metrics on has to use for these 2 types?
loss func. metrics
1. multi-label: (binary, categorical) (binary_accuracy, TopKCategorical accuracy, categorical_accuracy, AUC)
2. multi-class: (binary) (binary_accuracy,f1, recall, precision)
Please tell me from the above table which of them is/are more suitable, which of them is/are wrong & Why?
If you are trying to use multi-class classification provided that the labels (y) is one hot encoded, use the loss function as categorical crossentropy and use adam optimizer (It is suitable for most cases). Also, while using multi-class classification, the number of output nodes should be the same as the number of classes (or) labels. Say if your model is going to classify the input into 4 classes, You can configure the output layer as follows..
model.add(4, activation = "softmax")
Also, forgot to mention that softmax activation should be used in the output layer for multiclass classification problems.
Incase if your y is not one hot encoded, I would advise you to choose the loss function as sparse categorical crossentropy. No other changes will be necessary.
Also, I usually split the data into test data and train data and feed them to the model like this to get the accuracy in each epoch..
history = model.fit(train_data, validation_data = test_data, epochs = 10)
Hope it solved your problem.

What is the equivalence of Masking() Keras function in tensorflow? And does batch norm, conv, and relu support Masking?

I am training a GRU layer where inputs doesn't have the same length. Therefore, I have padded the inputs' features with 0.0 to make all sequences of same length. On the other hand, I don't want to compute any loss at any time step, for any sample as long as the input feature vector is all zeros. Example, at time step 1000, I have a batch size of 34, but samples number 33 and 34 of this batch lack data or feature values at time step 1000.
I have found that we can use the method Masking()(inputs) in Keras as long as all subsequent layers or operations support masking. But I have implemented my model in tensorflow. So what is the equivalence of Masking() in tensorflow?
Second, how can I know whether: batch normalization, conv layer and any non linear activation function has support for the masking() function in Keras?
Your help is much appreciated!!
So I found the detailed solution in danijar blog https://danijar.com/variable-sequence-lengths-in-tensorflow/.
The masking in keras is used when having incomplete sequences. So usually, you need to pad your sequences with 0.0 in the third dimension (The feature's dimension; when the input dimension has shape = [batch_size, sequence_length, num_features]).Afterwards, the masking in keras will take a number, will output 0 for their activations.
In summary: He showed how to compute the sequence length for each sample in the batch using length() he implemented. The output vector is then fed into the dynamic_rnn which will output zero vectors for incomplete sequences (for states and outputs), which is somehow similar to what happens in Keras Masking() function. Second, we should use a mask when computing the loss function.
All the details are discussed in this blog post.
But regarding the support thingy for masking in batch_norm, conv and non linear activation function; usually, if the output of the LSTM is zeros; then in case with sigmoid activation function at the output; the derivative of the output with respect to the input of the sigmoid function is output(1 - output). Hence, when the output is 0, this derivative is zero as well. And since back propagation applies the chain rule, then the gradients of the current sample with respect to any weight parameter in the network is going to be 0 as well. Hence, there is no need to worry about the support thingy... But the problem arises when the activation is relu for example, this is when the gradients should be explicitely multiplied by zeros before doing the back propagation (I guess). Maybe doing something like this will help:
final_output = output * mask
Then derivative of the final_output with respect to output will be the mask => 0 or 1 (the any time step; for any sample). Then, back propagate this gradient from the output of the activation function to its inputs...followed by chain rule => weights wont be affected in this case.

What are the differences between all these cross-entropy losses in Keras and TensorFlow?

What are the differences between all these cross-entropy losses?
Keras is talking about
Binary cross-entropy
Categorical cross-entropy
Sparse categorical cross-entropy
While TensorFlow has
Softmax cross-entropy with logits
Sparse softmax cross-entropy with logits
Sigmoid cross-entropy with logits
What are the differences and relationships between them? What are the typical applications for them? What's the mathematical background? Are there other cross-entropy types that one should know? Are there any cross-entropy types without logits?
There is just one cross (Shannon) entropy defined as:
H(P||Q) = - SUM_i P(X=i) log Q(X=i)
In machine learning usage, P is the actual (ground truth) distribution, and Q is the predicted distribution. All the functions you listed are just helper functions which accepts different ways to represent P and Q.
There are basically 3 main things to consider:
there are either 2 possibles outcomes (binary classification) or more. If there are just two outcomes, then Q(X=1) = 1 - Q(X=0) so a single float in (0,1) identifies the whole distribution, this is why neural network in binary classification has a single output (and so does logistic regresssion). If there are K>2 possible outcomes one has to define K outputs (one per each Q(X=...))
one either produces proper probabilities (meaning that Q(X=i)>=0 and SUM_i Q(X=i) =1 or one just produces a "score" and has some fixed method of transforming score to probability. For example a single real number can be "transformed to probability" by taking sigmoid, and a set of real numbers can be transformed by taking their softmax and so on.
there is j such that P(X=j)=1 (there is one "true class", targets are "hard", like "this image represent a cat") or there are "soft targets" (like "we are 60% sure this is a cat, but for 40% it is actually a dog").
Depending on these three aspects, different helper function should be used:
outcomes what is in Q targets in P
-------------------------------------------------------------------------------
binary CE 2 probability any
categorical CE >2 probability soft
sparse categorical CE >2 probability hard
sigmoid CE with logits 2 score any
softmax CE with logits >2 score soft
sparse softmax CE with logits >2 score hard
In the end one could just use "categorical cross entropy", as this is how it is mathematically defined, however since things like hard targets or binary classification are very popular - modern ML libraries do provide these additional helper functions to make things simpler. In particular "stacking" sigmoid and cross entropy might be numerically unstable, but if one knows these two operations are applied together - there is a numerically stable version of them combined (which is implemented in TF).
It is important to notice that if you apply wrong helper function the code will usually still execute, but results will be wrong. For example if you apply softmax_* helper for binary classification with one output your network will be considered to always produce "True" at the output.
As a final note - this answer considers classification, it is slightly different when you consider multi label case (when a single point can have multiple labels), as then Ps do not sum to 1, and one should use sigmoid_cross_entropy_with_logits despite having multiple output units.
Logits
For this purpose, "logits" can be seen as the non-activated outputs of the model.
While Keras losses always take an "activated" output (you must apply "sigmoid" or "softmax" before the loss)
Tensorflow takes them with "logits" or "non-activated" (you should not apply "sigmoid" or "softmax" before the loss)
Losses "with logits" will apply the activation internally.
Some functions allow you to choose logits=True or logits=False, which will tell the function whether to "apply" or "not apply" the activations.
Sparse
Sparse functions use the target data (ground truth) as "integer labels": 0, 1, 2, 3, 4.....
Non-sparse functions use the target data as "one-hot labels": [1,0,0], [0,1,0], [0,0,1]
Binary crossentropy = Sigmoid crossentropy
Problem type:
single class (false/true); or
non-exclusive multiclass (many classes may be correct)
Model output shape: (batch, ..., >=1)
Activation: "sigmoid"
Categorical crossentropy = Softmax crossentropy
Problem type: exclusive classes (only one class may be correct)
Model output shape: (batch, ..., >=2)
Activation: "softmax"

Output of tf.softmax_cross_entroy_with_logits unnormalized?

I implemented a simple cnn network for image classification (binary classification). I am using tensorflow in Python.
I am using tf.softmax_cross_entropy_with logits as a cost function. I feed the cost function with unnormalized logits from the output layer of my model. Should the function output normalized probabilities, or am I wrong?
During training of my model I am printing cost of every single example. If the model correctly predicts the output, the cost equals 0.0, otherwise the cost is very big, unnormalized value). While the function 'softmaxes' input before calculating cross entropy, why the output is unnormalized?
You are mistaking cross-entropy (your loss function) with softmax (the "virtual" output of your net -- see below). Softmax is normalized, but cross-entropy is not -- it can take arbitrarily high values to penalize bad predictions.
When you use a non-normalized net output in combination with tf.softmax_cross_entropy_with logits, you actually don't observe the softmax output: it is processed within the cost function and remains virtual. To peek at the softmax you can compute it explicitely using tf.nn.softmax on the non-normalized output of your net.

How tf.nn.softmax_cross_entropy_with_logits can compute softmax cross entropy in tensorflow?

tf.nn.softmax_cross_entropy_with_logits, Documentation says that it computes softmax cross entropy between logits and labels what does it mean? Is it not applying cross entropy loss function formula on it? Why documentation says that it computes sofmax cross entropy?
Also from the Docs:
Measures the probability error in discrete classification tasks in which the classes are mutually exclusive (each entry is in exactly one class).
Softmax classification uses cross-entropy loss function to train and classify data among discrete classes. There are other activation functions used like ReLU (Rectified Linear Units) or Sigmoid that are used in Linear Classification and NN; in this case Softmax is used.
Activation functions are decision functions (the ones that actually classify data into categories) and cross-entropy is the function used to calculate the error during training (you could use other ways to calculate the error cost like mean squares). However, cross-entropy seems to be the currently the best way to calculate it.
As some point out, softmax cross-entropy is a commonly used term in Classification for convenient notation.
Edit
Regarding the logits, it means that it works with its input data unscaled. In other words, the input data may not be a probability value (i.e., values may be > 1). Check this question to know more about softmax_cross_entropy_with_logits and its components.