Multilabel image classification with sparse labels in TensorFlow? - tensorflow

I want to perform a multilabel image classification task for n classes.
I've got sparse label vectors for each image and each dimension of each label vector is currently encoded in this way:
1.0 ->Label true / Image belongs to this class
-1.0 ->Label false / Image does not contain to this class.
0.0 ->missing value/label
E.g.: V= {1.0,-1.0,1.0, 0.0}
For this example V the model should learn, that the corresponding image should be classified in the first and third class.
My problem is currently how to handle the missing values/labels. I've searched through the issues and found this issue:
tensorflow/skflow#113 found here
So could do multilable image classification with:
tf.nn.sigmoid_cross_entropy_with_logits(logits, targets, name=None)
but TensorFlow has this error function for sparse softmax, which is used for exclusive classification:
tf.nn.sparse_softmax_cross_entropy_with_logits(logits, labels, name=None)
So is there something like sparse sigmoid cross entropy? (Couldn't find something) or any suggestions how can I handle my multilabel classification problem with sparse labels.

I used weighted_cross_entropy_with_logits as the loss function with positive weights for 1s.
In my case, all the labels are equally important. But 0 was ten times more likely to be appeared as the value of any label than 1.
So I weighed all the 1s by calling the pos_weight parameter of the aforementioned loss function. I used a pos_weight (= weight on positive values) of 10. By the way, I do not recommend any strategy to calculate the pos_weight. I think it will depend explicitly on the data in hand.
if real label = 1,
weighted_cross_entropy = pos_weight * sigmoid_cross_entropy
Weighted cross entropy with logits is same as the Sigmoid cross entropy with logits, except for the extra weight value multiplied to all the targets with a positive real value i.e.; 1.
Theoretically, it should do the job. I am still tuning other parameters to optimize the performance. Will update with performance statistics later.

First I would like to know what you mean by missing data? What is the difference between miss and false in your case?
Next, I think it is wrong that you represent your data like this. You have unrelated information that you try to represent on the same dimension. (If it was false or true it would work)
What seems to me better is to represent for each of your class a probability if it is good, or is missing or is false.
In your case V = [(1,0,0),(0,0,1),(1,0,0),(0,1,0)]

Ok!
So your problem is more about how to handle the missing data I think.
So I think you should definitely use tf.sigmoid_cross_entropy_with_logits()
Just change the target for the missing data to 0.5. (0 for false and 1 for true).
I never tried this approach but it should let your network learn without biasing it too much.

Related

Binary classification of pairs with opposite labels

I have a data-set without labels, but I do have a way to get pairs of examples with opposite labels, that is given a pair x,z I know that their true labels are either 0,1 or 1,0.
So, I am building a model that accepts pairs of samples as input, and learns to classify them with opposite labels. Assuming I have an arbitrary model for predicting a single sample, y_hat = f(x), I am building a model with Keras that accepts pairs of samples (x,z) and outputs pairs of predictions, f(x), f(z). I then use a custom loss function that drives the model towards the correct direction: Given that a regular binary classifier is trained using the Binary Cross Entropy (BCE) to make the predicted and desired output "close", I use the negative BCE. Also, since BCE is not symmetric, I symmetrize it. So, the loss function I give the model.compile method is:
from tensorflow import keras
bce = keras.losses.BinaryCrossentropy()
def neg_sym_bce(y1, y2):
return (- 0.5 * (bce(y1, y2) + bce(y2, y1)))
My problem is, this model fails to learn to classify even a single pair of my data (I get f(x)~=f(z)~=0.5), and if I try to train it with synthetic "easy" data, it takes hundreds of epochs to converge (also on a single pair).
This made me suspect that it has to do with a "vanishing gradient" problem. Indeed, when I plot (see below) the loss for a single pair, which is a function of 2 variables (the 2 outputs), it is evident that there is a wide plateau around the 0.5, 0.5 point. It is also evident that the global minima is, as expected, around the points 0,1 and 1,0.
So, is there a way to deal with the vanishing gradient here? I read about the problem but the references I found deal with vanishing gradient in the network, not in the loss itself.
Or, is there another loss that can drive the model to predict opposite labels?
Think if your labels are always either 0,1 or 1,1 just use categorical_crossentropy for the loss.

Tensorflow bounded regression vs classification

As part of my masters thesis I have been tasked with predicting a label integer (0-255) which is a binned representation of an angle. The feature columns are also integers, in the range (0-255).
So far I have used the custom Tensorflow layers estimator, implementing a 256 output classifier which performs well. However, my issue with the classification approach I am using is the following:
My classification model thinks that predicting a 3 instead of a 28 is as good/bad as predicting a 27 as a 28
The numerical interval / ordinal nature of my data (not sure which) leads me to believe that if I used regression I would achieve results with less drastically incorrect predictions or outliers.
My goal:
to reduce the number of drastically incorrect predicted outliers
My questions:
Is regression the better approach, or can I improve my
classification to include an ordinal/interval relationship between
my labels?
If I choose regression, is there a way to bound my predicted output between 0-255 (I know I will have to round float values predicted).
Thanks in advance. Any other comments, suggestions or ideas to help me to best tackle the problem are also very helpful.
If I made any incorrect assumptions or mistake in my interpretation of the problem feel free to correct me.
Question 1: Regression is the simpler approach, however, you can also use classification and manipulate the loss function to have a lower loss for misclassifications that are "close" to the original class.
Question 2: The tensorflow command for bounding your prediction is tf.clip_by_value. Are you mapping all 360 degrees to [0,255]? In that case you will want to consider the boundary cases, i.e. your estimator yields -4 and the true value is 251, but they are the actually representing the same value so loss should be 0.

Use tf.layers.batch_normalization to preprocess inputs for SELU activation function?

The SELU activation function (https://github.com/bioinf-jku/SNNs/blob/master/selu.py) requires the input to be normalized to have the mean value of 0.0 and the variance of 1.0. Therefore, I tried to apply tf.layers.batch_normalization (axis=-1) on the raw data to meet that requirement. The raw data in each batch have the shape of [batch_size, 15], where 15 refers to the number of features. The graph below shows the variances of 5 of these features returned from tf.layers.batch_normalization (~20 epochs). They are not all close to 1.0 as expected. The mean values are not all close to 0.0 as well (graphs not shown).
How should I get the 15 features all normalized independently (I expect every feature after normalization will have mean = 0 and var = 1.0)?
After reading the original papers of batch normalization (https://arxiv.org/abs/1502.03167) and SELU (https://arxiv.org/abs/1706.02515), I have a better understanding of them:
batch normalization is an "isolation" procedure to ensure the input (in any mini-batch) to the next layer has a fixed distribution, therefore the so called "shifting variance" problem is fixed. The affine transform ( γ*x^ + β ) just tunes the standardized x^ to another fixed distribution for better expressiveness. For the simple normalization, we need to turn the center and scale parameters to False when calling tf.layers.batch_normalization.
Make sure the epsilon (still in tf.layers.batch_normalization) is set to at least 2 magnitudes less than the lowest magnitude of the all input data. The default value of epsilon is set to 0.001. For my case, some features have values as low as 1e-6. Therefore, I had to change epsilon to 1e-8.
The inputs to SELU have to be normalized before feeding them into the model. tf.layers.batch_normalization is not designed for that purpose.

what is the difference between sampled_softmax_loss and nce_loss in tensorflow?

i notice there are two functions about negative Sampling in tensorflow to compute the loss (sampled_softmax_loss and nce_loss). the paramaters of these two function are similar, but i really want to know what is the difference between the two?
Sample softmax is all about selecting a sample of the given number and try to get the softmax loss. Here the main objective is to make the result of the sampled softmax equal to our true softmax. So algorithm basically concentrate lot on selecting the those samples from the given distribution.
On other hand NCE loss is more of selecting noise samples and try to mimic the true softmax. It will take only one true class and a K noise classes.
Sampled softmax tries to normalise over all samples in your output. Having a non-normal distribution (logarithmic over your labels) this is not an optimal loss function. Note that although they have the same parameters, they way you use the function is different. Take a look at the documentation here: https://github.com/calebchoo/Tensorflow/blob/master/tensorflow/g3doc/api_docs/python/functions_and_classes/shard4/tf.nn.nce_loss.md and read this line:
By default this uses a log-uniform (Zipfian) distribution for sampling, so your labels must be sorted in order of decreasing frequency to achieve good results. For more details, see log_uniform_candidate_sampler.
Take a look at this paper where they explain why they use it for word embeddings: http://papers.nips.cc/paper/5165-learning-word-embeddings-efficiently-with-noise-contrastive-estimation.pdf
Hope this helps!
Check out this documentation from TensorFlow https://www.tensorflow.org/extras/candidate_sampling.pdf
They seem pretty similar, but sampled softmax is only applicable for a single label while NCE extends to the case where your labels are a multiset. NCE can then model the expected counts rather than presence/absence of a label. I'm not clear on an exact example of when to use the sampled_softmax.

tensorflow - softmax ignore negative labels (just like caffe) [duplicate]

This question already has answers here:
TensorFlow: How to handle void labeled data in image segmentation?
(2 answers)
Closed 5 years ago.
In Caffe, there is an option with its SoftmaxWithLoss function to ignore all negative labels (-1) in computing probabilities, so that only 0 or positive label probabilities add up to 1.
Is there a similar feature with Tensorflow softmax loss?
Just came up with a work-around --- I created a one-hot tensor on the label indices using tf.one_hot (with the depth set at the # of labels). tf.one_hot automatically zeros out all indices with -1 in the resulting one_hot tensor (of shape [batch, # of labels])
This enables softmax loss (i.e. tf.nn.softmax_cross_entropy_with_logits) to "ignore" all -1 labels.
I am not quite sure that your workaround is actually working.
Caffe's ignore_label in caffe semantically has to be considered as "label of a sample which has to be ignored", thus it has as an effect that the gradient for that sampl_e is not backpropagated, which is in no way guranteed by the use of a one hot vector.
On one hand, I expect any meaningful model to quickly learn to predict a zero value, or small enough value, for that specific entry, cause of the fact all samples will have a zero in that specific entry, so to say, backpropagated info due to errors in that prediction will vanish relativly fast.
On the other hand you need to be aware that, from a math point of view caffe's ignore_label and what you are doing are totally different.
Said this, I am new to TF and need the exact same feature as caffe's ignore_label.