what is the difference between sampled_softmax_loss and nce_loss in tensorflow? - tensorflow

i notice there are two functions about negative Sampling in tensorflow to compute the loss (sampled_softmax_loss and nce_loss). the paramaters of these two function are similar, but i really want to know what is the difference between the two?

Sample softmax is all about selecting a sample of the given number and try to get the softmax loss. Here the main objective is to make the result of the sampled softmax equal to our true softmax. So algorithm basically concentrate lot on selecting the those samples from the given distribution.
On other hand NCE loss is more of selecting noise samples and try to mimic the true softmax. It will take only one true class and a K noise classes.

Sampled softmax tries to normalise over all samples in your output. Having a non-normal distribution (logarithmic over your labels) this is not an optimal loss function. Note that although they have the same parameters, they way you use the function is different. Take a look at the documentation here: https://github.com/calebchoo/Tensorflow/blob/master/tensorflow/g3doc/api_docs/python/functions_and_classes/shard4/tf.nn.nce_loss.md and read this line:
By default this uses a log-uniform (Zipfian) distribution for sampling, so your labels must be sorted in order of decreasing frequency to achieve good results. For more details, see log_uniform_candidate_sampler.
Take a look at this paper where they explain why they use it for word embeddings: http://papers.nips.cc/paper/5165-learning-word-embeddings-efficiently-with-noise-contrastive-estimation.pdf
Hope this helps!

Check out this documentation from TensorFlow https://www.tensorflow.org/extras/candidate_sampling.pdf
They seem pretty similar, but sampled softmax is only applicable for a single label while NCE extends to the case where your labels are a multiset. NCE can then model the expected counts rather than presence/absence of a label. I'm not clear on an exact example of when to use the sampled_softmax.

Related

Loss function variational Autoencoder in Tensorflow example

I have a question regarding the loss function in variational autoencoder. I followed the tensorflow example https://www.tensorflow.org/tutorials/generative/cvae to create a LSTM-VAE, for sampling a sinus function.
My encoder-input is a set of points (x_i,sin(x_i)) for a specific range (randomly sampled), and as output of the decoder I expect similar values.
In the tensorflow guide, there is cross-entropy used to compare the encoder input with the decoder output.
cross_ent = tf.nn.sigmoid_cross_entropy_with_logits(logits=x_logit, labels=x)
This makes sense, because the input and output are treated as probabilities. But in reality these probabily functions represent the sets of my sinus function.
Can't I simply use a mean-squared-error instead of the cross-entropy (I tried it and it works well) or causes this a wrong behaviour of the architecture at some point?
Best regards and thanks for your help!
Well, such questions happen when you work too much and stop thinking properly. For the sake of solving this, it makes sense to think about what I'm trying to do.
p(x|z) is the decoder reconstruction, what means, that by sampling from z the value x is generated with the probability of p. In the tensorflow-example image-classification/generation is used, in that case crossentropy makes sense. I simply want to minimize the distance between my input and output. The use of mse is kind of logical.
Hope that helps someone at some point.
Regards.

How am I getting 92% accuracy after initialising parameters with zeros in a simple one layer neural network?

This is from one of the tensorflow examples mnist_softmax.py.
Even though the gradients are non-zero, they must be identical and all the ten weight vectors corresponding to the ten classes should be exactly same and produce the same output logits and hence same probabilities. The only case I could think this is possible is while calculating the accuracy using tf.argmax(), whose output is ambiguous in case of ties, we are getting lucky and resulting in 92% accuracy. But then I checked the values of y after training is complete and they give perfectly different outputs indicating the weight vectors of all classes are not same. Can someone explain how this is possible?
Although it is best to initialize the parameters to small random numbers to break symmetry and possibly accelerate learning, it does not necessarily mean you will get same probabilities for all classes if you initialize the weights to zeros.
The reason is because the cross_entropy function is a function of weights, inputs, and correct class labels. So the gradient will be different for each output 'neuron', depending on the correct class label, and this will break the symmetry.

xgboost using the auc metric correctly

I have a slightly imbalanced dataset for a binary classification problem, with a positive to negative ratio of 0.6.
I recently learned about the auc metric from this answer: https://stats.stackexchange.com/a/132832/128229, and decided to use it.
But I came across another link http://fastml.com/what-you-wanted-to-know-about-auc/ which claims that, the AUC-ROC is insensitive to class imbalance, and we should use AUC for a precision-recall curve.
The xgboost docs are not clear on which AUC they use, do they use AUC-ROC?
Also the link mentions that AUC should only be used if you do not care about the probability and only care about the ranking.
However since i am using a binary:logistic objective i think i should care about probabilities since i have to set a threshold for my predictions.
The xgboost parameter tuning guide https://github.com/dmlc/xgboost/blob/master/doc/how_to/param_tuning.md
also suggests an alternate method to handle class imbalance, by not balancing positive and negative samples and using max_delta_step = 1.
So can someone explain, when is the AUC preffered over the other method for xgboost to handle class imbalance. And if i am using AUC , what is the threshold i need to set for prediction or more generally how exactly should i use AUC for handling imbalanced binary classification problem in xgboost?
EDIT:
I also need to eliminate false positives more than false negatives, how can i achieve that, apart from simply varying the threshold, with binary:logistic objective?
According the xgboost parameters section in here there is auc and aucprwhere prstands for precision recall.
I would say you could build some intuition by running both approaches and see how the metrics behave. You can include multiple metric and even optimize with respect to whichever you prefer.
You can also monitor the false positive (rate) in each boosting round by creating custom metric.
XGboost chose to write AUC (Area under the ROC Curve), but some prefer to be more explicit and say AUC-ROC / ROC-AUC.
https://xgboost.readthedocs.io/en/latest/parameter.html

Multilabel image classification with sparse labels in TensorFlow?

I want to perform a multilabel image classification task for n classes.
I've got sparse label vectors for each image and each dimension of each label vector is currently encoded in this way:
1.0 ->Label true / Image belongs to this class
-1.0 ->Label false / Image does not contain to this class.
0.0 ->missing value/label
E.g.: V= {1.0,-1.0,1.0, 0.0}
For this example V the model should learn, that the corresponding image should be classified in the first and third class.
My problem is currently how to handle the missing values/labels. I've searched through the issues and found this issue:
tensorflow/skflow#113 found here
So could do multilable image classification with:
tf.nn.sigmoid_cross_entropy_with_logits(logits, targets, name=None)
but TensorFlow has this error function for sparse softmax, which is used for exclusive classification:
tf.nn.sparse_softmax_cross_entropy_with_logits(logits, labels, name=None)
So is there something like sparse sigmoid cross entropy? (Couldn't find something) or any suggestions how can I handle my multilabel classification problem with sparse labels.
I used weighted_cross_entropy_with_logits as the loss function with positive weights for 1s.
In my case, all the labels are equally important. But 0 was ten times more likely to be appeared as the value of any label than 1.
So I weighed all the 1s by calling the pos_weight parameter of the aforementioned loss function. I used a pos_weight (= weight on positive values) of 10. By the way, I do not recommend any strategy to calculate the pos_weight. I think it will depend explicitly on the data in hand.
if real label = 1,
weighted_cross_entropy = pos_weight * sigmoid_cross_entropy
Weighted cross entropy with logits is same as the Sigmoid cross entropy with logits, except for the extra weight value multiplied to all the targets with a positive real value i.e.; 1.
Theoretically, it should do the job. I am still tuning other parameters to optimize the performance. Will update with performance statistics later.
First I would like to know what you mean by missing data? What is the difference between miss and false in your case?
Next, I think it is wrong that you represent your data like this. You have unrelated information that you try to represent on the same dimension. (If it was false or true it would work)
What seems to me better is to represent for each of your class a probability if it is good, or is missing or is false.
In your case V = [(1,0,0),(0,0,1),(1,0,0),(0,1,0)]
Ok!
So your problem is more about how to handle the missing data I think.
So I think you should definitely use tf.sigmoid_cross_entropy_with_logits()
Just change the target for the missing data to 0.5. (0 for false and 1 for true).
I never tried this approach but it should let your network learn without biasing it too much.

Unaggregated gradients / gradients per example in tensorflow

Given a simple mini-batch gradient descent problem on mnist in tensorflow (like in this tutorial), how can I retrieve the gradients for each example in the batch individually.
tf.gradients() seems to return gradients averaged over all examples in the batch. Is there a way to retrieve gradients before aggregation?
Edit: A first step towards this answer is figuring out at which point tensorflow averages the gradients over the examples in the batch. I thought this happened in _AggregatedGrads, but that doesn't appear to be the case. Any ideas?
tf.gradients returns the gradient with respect to the loss. This means that if your loss is a sum of per-example losses, then the gradient is also the sum of per-example loss gradients.
The summing up is implicit. For instance if you want to minimize the sum of squared norms of Wx-y errors, the gradient with respect to W is 2(WX-Y)X' where X is the batch of observations and Y is the batch of labels. You never explicitly form "per-example" gradients that you later sum up, so it's not a simple matter of removing some stage in the gradient pipeline.
A simple way to get k per-example loss gradients is to use batches of size 1 and do k passes. Ian Goodfellow wrote up how to get all k gradients in a single pass, for this you would need to specify gradients explicitly and not rely on tf.gradients method
To partly answer my own question after tinkering with this for a while. It appears that it is possible to manipulate gradients per example while still working in batch by doing the following:
Create a copy of tf.gradients() that accepts an extra tensor/placeholder with example-specific factors
Create a copy of _AggregatedGrads() and add a custom aggregation method that uses the example-specific factors
Call your custom tf.gradients function and give your loss as a list of slices:
custagg_gradients(
ys=[cross_entropy[i] for i in xrange(batch_size)],
xs=variables.trainable_variables(),
aggregation_method=CUSTOM,
gradient_factors=gradient_factors
)
But this will probably have the same complexity as doing individual passes per example, and I need to check if the gradients are correct :-).
One way of retrieving gradients before aggregation is to use the grads_ys parameter. A good discussion is found here:
Use of grads_ys parameter in tf.gradients - TensorFlow
EDIT:
I haven't been working with Tensorflow a lot lately, but here is an open issue tracking the best way to compute unaggregated gradients:
https://github.com/tensorflow/tensorflow/issues/675
There is a lot of sample code solutions provided by users (including myself) that you can try based on your needs.