tensorflow - softmax ignore negative labels (just like caffe) [duplicate] - tensorflow

This question already has answers here:
TensorFlow: How to handle void labeled data in image segmentation?
(2 answers)
Closed 5 years ago.
In Caffe, there is an option with its SoftmaxWithLoss function to ignore all negative labels (-1) in computing probabilities, so that only 0 or positive label probabilities add up to 1.
Is there a similar feature with Tensorflow softmax loss?

Just came up with a work-around --- I created a one-hot tensor on the label indices using tf.one_hot (with the depth set at the # of labels). tf.one_hot automatically zeros out all indices with -1 in the resulting one_hot tensor (of shape [batch, # of labels])
This enables softmax loss (i.e. tf.nn.softmax_cross_entropy_with_logits) to "ignore" all -1 labels.

I am not quite sure that your workaround is actually working.
Caffe's ignore_label in caffe semantically has to be considered as "label of a sample which has to be ignored", thus it has as an effect that the gradient for that sampl_e is not backpropagated, which is in no way guranteed by the use of a one hot vector.
On one hand, I expect any meaningful model to quickly learn to predict a zero value, or small enough value, for that specific entry, cause of the fact all samples will have a zero in that specific entry, so to say, backpropagated info due to errors in that prediction will vanish relativly fast.
On the other hand you need to be aware that, from a math point of view caffe's ignore_label and what you are doing are totally different.
Said this, I am new to TF and need the exact same feature as caffe's ignore_label.

Related

Binary classification of pairs with opposite labels

I have a data-set without labels, but I do have a way to get pairs of examples with opposite labels, that is given a pair x,z I know that their true labels are either 0,1 or 1,0.
So, I am building a model that accepts pairs of samples as input, and learns to classify them with opposite labels. Assuming I have an arbitrary model for predicting a single sample, y_hat = f(x), I am building a model with Keras that accepts pairs of samples (x,z) and outputs pairs of predictions, f(x), f(z). I then use a custom loss function that drives the model towards the correct direction: Given that a regular binary classifier is trained using the Binary Cross Entropy (BCE) to make the predicted and desired output "close", I use the negative BCE. Also, since BCE is not symmetric, I symmetrize it. So, the loss function I give the model.compile method is:
from tensorflow import keras
bce = keras.losses.BinaryCrossentropy()
def neg_sym_bce(y1, y2):
return (- 0.5 * (bce(y1, y2) + bce(y2, y1)))
My problem is, this model fails to learn to classify even a single pair of my data (I get f(x)~=f(z)~=0.5), and if I try to train it with synthetic "easy" data, it takes hundreds of epochs to converge (also on a single pair).
This made me suspect that it has to do with a "vanishing gradient" problem. Indeed, when I plot (see below) the loss for a single pair, which is a function of 2 variables (the 2 outputs), it is evident that there is a wide plateau around the 0.5, 0.5 point. It is also evident that the global minima is, as expected, around the points 0,1 and 1,0.
So, is there a way to deal with the vanishing gradient here? I read about the problem but the references I found deal with vanishing gradient in the network, not in the loss itself.
Or, is there another loss that can drive the model to predict opposite labels?
Think if your labels are always either 0,1 or 1,1 just use categorical_crossentropy for the loss.

Dimensions of a convolution?

I have some questions regarding how this convolution is calculated and its output dimension. I'm familiar with simple convolutions with a nxm kernel, using strides, dilations or padding, thats not a problem, but this dimensions seems odd to me. Since the model that I'm using is pretty well known onnx-mnist, I assume it is correct.
So, my point is:
If the input has a dimensions of 1x1x28x28, how is the output 1x8x28x28?
W denotes the kernel. How can it be 8x1x5x5? As far as I know, the first dimension is the batch size, but here I'm just doing inference with 1 input. Does this make sense?
I'm implementing from scratch this convolution operator, and so far it works for 1x1x28x28 and a kernel of 1x1x5x5, but that extra dimensions doesn't make sense to me.
Find attached the convolution that I'm trying to do, hope is not too onnx specific.
I do not see the code you are using but I guess 8 is the number of kernels. This means you apply 8 different kernels on your input with the size 5x5 over a batch size of 1. That is how you get 1x8x28x28 in the output, the 8 denotes the number of activation maps (one for each kernel).
The numbers of your kernel dimensions (8x1x5x5) explained:
8: Number of different filters/kernels (will be number of output maps per image)
1: Number of input channels. If your input image was RGB instead of grayscale, this would be 3 instead of 1.
5: First spatial dimension
5: Second spatial dimension

Adding a second loss mask in tensorflow

I'm using google's seq2seq library and based on some computations that I do between the predictions and the input I would like to zero out some of the losses for certain time steps (not for padding).
What I do is basically to go through each batch then through each time step of the decoder (logits) and for eatc time step I add "a zero or a one" to a list (based on my computation). This list should then be converted to a tensor and multiplied by the losses.
My problem is the shape of the tensor reurned by the sparse_softmax_cross_entropy_with_logits is variable, its not always the shape of the target tensor. So there is a mismatch of dimensions. Has anyone does something like this before and can share it, or know why this happens.

What are the effects of padding a tensor?

I'm working on a problem using Keras that has been presenting me with issues:
My X data is all of shape (num_samples, 8192, 8), but my Y data is of shape (num_samples, 4), where 4 is a one-hot encoded vector.
Both X and Y data will be run through LSTM layers, but the layers are rejecting the Y data because it doesn't match the shape of the X data.
Is padding the Y data with 0s so that it matches the dimensions of the X data unreasonable? What kind of effects would that have? Is there a better solution?
Edited for clarification:
As requested, here is more information:
My Y data represents the expected output of passing the X data through my model. This is my first time working with LSTMs, so I don't have an architecture in mind, but I'd like to use an architecture that works well with classifying long (8192-length) sequences of words into one of several categories. Additionally, the dataset that I have is of an immense size when fed through an LSTM, so I'm currently using batch-training.
Technologies being used:
Keras (Tensorflow Backend)
TL;DR Is padding one tensor with zeroes in all dimensions to match another tensor's shape a bad idea? What could be a better approach?
First of all, let's make sure your representation is actually what you think it is; the input to an LSTM (or any recurrent layer, for that matter) must be of dimensionality: (timesteps, shape), i.e. if you have 1000 training samples, each consisting of 100 timesteps, with each timestep having 10 values, your input shape will be (100,10,). Therefore I assume from your question that each input sample in your X set has 8192 steps and 8 values per step. Great; a single LSTM layer can iterate over these and produce 4-dimensional representations with absolutely no problem, just like so:
myLongInput = Input(shape=(8192,8,))
myRecurrentFunction = LSTM(4)
myShortOutput = myRecurrentFunction(myLongInput)
myShortOutput.shape
TensorShape([Dimension(None), Dimension(4)])
I assume your problem stems from trying to apply yet another LSTM on top of the first one; the next LSTM expects a tensor that has a time dimension, but your output has none. If that is the case, you'll need to let your first LSTM also output the intermediate representations at each time step, like so:
myNewRecurrentFunction=LSTM(4, return_sequences=True)
myLongOutput = myNewRecurrentFunction(myLongInput)
myLongOutput.shape
TensorShape([Dimension(None), Dimension(None), Dimension(4)])
As you can see the new output is now a 3rd order tensor, with the second dimension now being the (yet unassigned) timesteps. You can repeat this process until your final output, where you usually don't need the intermediate representations but rather only the last one. (Sidenote: make sure to set the activation of your last layer to a softmax if your output is in one-hot format)
On to your original question, zero-padding has very little negative impact on your network. The network will strain itself a bit in the beginning trying to figure out the concept of the additional values you have just thrown at it, but will very soon be able to learn they're meaningless. This comes at a cost of a larger parameter space (therefore more time and memory complexity), but doesn't really affect predictive power most of the time.
I hope that was helpful.

Multilabel image classification with sparse labels in TensorFlow?

I want to perform a multilabel image classification task for n classes.
I've got sparse label vectors for each image and each dimension of each label vector is currently encoded in this way:
1.0 ->Label true / Image belongs to this class
-1.0 ->Label false / Image does not contain to this class.
0.0 ->missing value/label
E.g.: V= {1.0,-1.0,1.0, 0.0}
For this example V the model should learn, that the corresponding image should be classified in the first and third class.
My problem is currently how to handle the missing values/labels. I've searched through the issues and found this issue:
tensorflow/skflow#113 found here
So could do multilable image classification with:
tf.nn.sigmoid_cross_entropy_with_logits(logits, targets, name=None)
but TensorFlow has this error function for sparse softmax, which is used for exclusive classification:
tf.nn.sparse_softmax_cross_entropy_with_logits(logits, labels, name=None)
So is there something like sparse sigmoid cross entropy? (Couldn't find something) or any suggestions how can I handle my multilabel classification problem with sparse labels.
I used weighted_cross_entropy_with_logits as the loss function with positive weights for 1s.
In my case, all the labels are equally important. But 0 was ten times more likely to be appeared as the value of any label than 1.
So I weighed all the 1s by calling the pos_weight parameter of the aforementioned loss function. I used a pos_weight (= weight on positive values) of 10. By the way, I do not recommend any strategy to calculate the pos_weight. I think it will depend explicitly on the data in hand.
if real label = 1,
weighted_cross_entropy = pos_weight * sigmoid_cross_entropy
Weighted cross entropy with logits is same as the Sigmoid cross entropy with logits, except for the extra weight value multiplied to all the targets with a positive real value i.e.; 1.
Theoretically, it should do the job. I am still tuning other parameters to optimize the performance. Will update with performance statistics later.
First I would like to know what you mean by missing data? What is the difference between miss and false in your case?
Next, I think it is wrong that you represent your data like this. You have unrelated information that you try to represent on the same dimension. (If it was false or true it would work)
What seems to me better is to represent for each of your class a probability if it is good, or is missing or is false.
In your case V = [(1,0,0),(0,0,1),(1,0,0),(0,1,0)]
Ok!
So your problem is more about how to handle the missing data I think.
So I think you should definitely use tf.sigmoid_cross_entropy_with_logits()
Just change the target for the missing data to 0.5. (0 for false and 1 for true).
I never tried this approach but it should let your network learn without biasing it too much.