I want to train a HMM classifier with features as input. Considering two observation states(o1, o2) and two hidden states(h1, h2), and some initial probability I apply a supervised algorithm and on the basis of the classifier output, calculate the following
Transition prob :
[ P(h1/h1), P( h1/ h2);
P(h2/ h1),P(h2/h2)].
emission prob:
[p(o1/h1), p(o1/h2);
p(o2/h1), p(o2/h2)]
Is this the correct way to calculate the probabilities?
Related
I am trying to get a grasp on pre training BERT models. I am using the TFBertForPreTraining model.
According the original paper BERT is pre trained on Next sentence prediction and mask language modeling simultaneously. Hence I pick pairs of sequences where 50% of them are following one another and 50% are not. Additionally I mask 15% of the input tokens and remember the masked tokens as labels. Now when passing the inputs through the model I get a prediction_logits and seq_relationship_logits as outputs. The prediction_logits are of shape (batch_size, max_length, vocab_size). As far as I am concerned, each (i,j,:) is the probability distribution on the vocabulary for the j-th token in the input sequence.
Now the original paper states that the loss is
the sum of the mean masked LM
likelihood and the mean next sentence prediction
likelihood
Finally my question would be if the mean for LM is taken over the whole sequence or only on the masked token or maybe also on the whole sequence but the padding tokens.
I'm evaluating the possibility to introduce a new loss for the subject described above.
Let be the number of examples, the number of classes,
the classifier output on example and the binary indicator (0 or 1) if class label is the correct classification for the observation on pattern . It would be:
where the cross-entropy loss is used and c.e. losses are correlated by a similarity matrix S positive semi-definite.
Is there any method to write this custom loss on a software framework for deep learning (e.g.: TensorFlow) and have a backtracking algorithm that is based on it?
While the feed-forward-->backtracking cycle advances, new patterns are processed by the Neural Network. Being so, new errors are determined:
where x is the number of feed-forward steps.
So It should be possible to backtrack using the loss:
where x is the number of feed-forward steps.
In case you can help me, we can be co-authors of a paper about this subject.
TL;DR: you can just skip to the question in yellow box below.
Suppose I have a Encoder-Decoder Neural Network, with weights W_1 and W_2 of the encoder and decoder respectively. Let's denote Z as the output of the encoder. The network is trained with batch size n, and all the gradients will be calculated with respect to the mean loss value over the batch (as shown in image below, the L_hat which is the sum of per-sample loss L).
What I'm trying to achieve is, in the backward pass, to manipulate the gradients of Z before passing it further to the encoder's weights W_1. Suppose is a somehow modified gradients operator, for which the following holds:
The described above, in case of a synchronuous pass (first calculate the modified gradients of Z, then propagate down to W_1) is very easy to implement (the Jacobian multiplication is done using grad_ys of tf.gradients):
def modify_grad(grad_z):
# do some modifications
grad_z = tf.gradients(L_hat, Z)
mod_grad_z = modify_grad(grad_z)
mod_grad_w1 = tf.gradients(Z, W_1, mod_grad_z)
The problem is, I need to accumulate the gradients grad_z of the tensor Z over several batches. As the shape of it is dynamic (with None in one of the dimensions, as in the illustration above), I cannot define a tf.Variable to store it. Furthermore, the batch size n may change during training. How can I store the average of grad_z over several batches?
PS: I just wanted to combine pareto-optimal training of ArXiv:1810.04650, the asynchronous network training of ArXiv:1609.02132, and batch size scheduling of ArXiv:1711.00489.
In the deep learning implementations related to object detection and semantic segmentation, I have seen the output layers using either sigmoid or softmax. I am not very clear when to use which? It seems to me both of them can support these tasks. Are there any guidelines for this choice?
softmax() helps when you want a probability distribution, which sums up to 1. sigmoid is used when you want the output to be ranging from 0 to 1, but need not sum to 1.
In your case, you wish to classify and choose between two alternatives. I would recommend using softmax() as you will get a probability distribution which you can apply cross entropy loss function on.
The sigmoid and the softmax function have different purposes. For a detailed explanation of when to use sigmoid vs. softmax in neural network design, you can look at this article: "Classification: Sigmoid vs. Softmax."
Short summary:
If you have a multi-label classification problem where there is more than one "right answer" (the outputs are NOT mutually exclusive) then you can use a sigmoid function on each raw output independently. The sigmoid will allow you to have high probability for all of your classes, some of them, or none of them.
If you instead have a multi-class classification problem where there is only one "right answer" (the outputs are mutually exclusive), then use a softmax function. The softmax will enforce that the sum of the probabilities of your output classes are equal to one, so in order to increase the probability of a particular class, your model must correspondingly decrease the probability of at least one of the other classes.
Object detection is object classification used on a sliding window in the image. In classification it is important to find the correct output in some class space. E.g. you detect 10 different objects and you want to know which object is the most likely one in there. Then softmax is good because of its proberty that the whole layer sums up to 1.
Semantic segmentation on the other hand segments the image in some way. I have done semantic medical segmentation and there the output is a binary image. This means you can have sigmoid as output to predict if this pixel belongs to this specific class, because sigmoid values are between 0 and 1 for each output class.
In general Softmax is used (Softmax Classifier) when ‘n’ number of classes are there. Sigmoid or softmax both can be used for binary (n=2) classification.
Sigmoid:
S(x) = 1/ ( 1+ ( e^(-x) ))
Softmax:
σ(x)j = e / **Σ**{k=1 to K} e^zk for(j=1.....K)
Softmax is kind of Multi Class Sigmoid, but if you see the function of Softmax, the sum of all softmax units are supposed to be 1. In sigmoid it’s not really necessary.
Digging deep, you can also use sigmoid for multi-class classification. When you use a softmax, basically you get a probability of each class, (join distribution and a multinomial likelihood) whose sum is bound to be one. In case you use sigmoid for multi class classification, it’d be like a marginal distribution and a Bernoulli likelihood, p(y0/x) , p(y1/x) etc
I have read the docs of both functions, but as far as I know, for function tf.nn.softmax_cross_entropy_with_logits(logits, labels, dim=-1, name=None), the result is the cross entropy loss, in which the dimensions of logits and labels are the same.
But, for function tf.nn.sparse_softmax_cross_entropy_with_logits, the dimensions of logits and labels are not the same?
Could you give a more detail example of tf.nn.sparse_softmax_cross_entropy_with_logits?
The difference is that tf.nn.softmax_cross_entropy_with_logits doesn't assume that the classes are mutually exclusive:
Measures the probability error in discrete classification tasks in
which each class is independent and not mutually exclusive. For
instance, one could perform multilabel classification where a picture
can contain both an elephant and a dog at the same time.
Compare with sparse_*:
Measures the probability error in discrete classification tasks in
which the classes are mutually exclusive (each entry is in exactly one
class). For example, each CIFAR-10 image is labeled with one and only
one label: an image can be a dog or a truck, but not both.
As such, with sparse functions, the dimensions of logits and labels are not the same: labels contain one number per example, whereas logits the number of classes per example, denoting probabilities.