TFBertForPretraining prediction logits in Masked LM - tensorflow

I am trying to get a grasp on pre training BERT models. I am using the TFBertForPreTraining model.
According the original paper BERT is pre trained on Next sentence prediction and mask language modeling simultaneously. Hence I pick pairs of sequences where 50% of them are following one another and 50% are not. Additionally I mask 15% of the input tokens and remember the masked tokens as labels. Now when passing the inputs through the model I get a prediction_logits and seq_relationship_logits as outputs. The prediction_logits are of shape (batch_size, max_length, vocab_size). As far as I am concerned, each (i,j,:) is the probability distribution on the vocabulary for the j-th token in the input sequence.
Now the original paper states that the loss is
the sum of the mean masked LM
likelihood and the mean next sentence prediction
likelihood
Finally my question would be if the mean for LM is taken over the whole sequence or only on the masked token or maybe also on the whole sequence but the padding tokens.

Related

Keras variable input

Im working through a Keras example at https://www.tensorflow.org/tutorials/text/text_generation
The model is built here:
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
model = tf.keras.Sequential([
tf.keras.layers.Embedding(vocab_size, embedding_dim,
batch_input_shape=[batch_size, None]),
tf.keras.layers.GRU(rnn_units,
return_sequences=True,
stateful=True,
recurrent_initializer='glorot_uniform'),
tf.keras.layers.Dense(vocab_size)
])
return model
During training, they always pass in a length 100 array of ints.
But during prediction, they are able to pass in any length of input and the output is the same length as the input. I was always under the impression that the lengths of the time steps had to be the same. Is that not the case and the # of time steps of the RNN somehow can change?
RNNs are sequence models, ie. they take in a sequence of input and give out a sequence of outputs. The sequence length is also called the time steps is number of time the RNN cell is unwrapped and for each unwrapping an input is passed and RNN cell using its gates gives out an output (per each unwrapping). So in theory you can have as long sequence as you want. Now lets assume you have different inputs of different size, since you cannot have variable size inputs in a single batches you have to collect the inputs of same size an make a batch if you want to train using batches. You can as well use batch size of 1 and not worry about all this, but training become painfully slow.
In ptractical situations, while training we divide input into same sizes so that training become fast. There are situations like language translation models where this is not feasible.
So in theory RNNs does not have any limitation on the sequence length, however large sequence will start to loose the context at the begging as the sequence length increases.
While predictions you can use any sequence length you want to.
In you case your output size is same as input size because of return_sequences=True. You can as well have single output by using return_sequences=False where in only the output of last unwrapping is returned by keras.
Length of training sequences should not be equal to predicted length.
RNN deals with two vectors: new word and hidden state (accumulated from the previous words). It doesn't keep length of sequence.
But to get good prediction of long sequences - you have to train RNN with long sequences - because RNN should learn a long context.

Keras Masking layer for LSTM input to mask features instead of timesteps

I gather that Masking layers in Keras are commonly used for handling data inputs with varying timesteps. Based on the documentation, I understand that if all of the features for a given timestep equal the mask value, then that timestep will be skipped in downstream layers.
For my problem, I am instead interested in using masking for features, where the data input shape to the network is (batch_size, num_timesteps, num_features). Essentially, I want to be able to predict a timeseries one step into the future with num_features features, but assuming that I won't always have all the features from the previous timestep to base my prediction on.
For example, one could predict RGB values one timestep into the future for a pixel in a video stream based on partial data from a previous timestep. At every timestep the output should be all RGB, but some timesteps you may get only RG, or only RB, or only BG, but you never know which partial data you'll have at each timestep to make your prediction. This is why I want to somehow be able to indicate a feature as masked during training to accommodate this kind of prediction.
It may be that Masking in Keras is not the correct mechanism to achieve this. What is the correct type of network layer that would give me this behavior?

What is the equivalence of Masking() Keras function in tensorflow? And does batch norm, conv, and relu support Masking?

I am training a GRU layer where inputs doesn't have the same length. Therefore, I have padded the inputs' features with 0.0 to make all sequences of same length. On the other hand, I don't want to compute any loss at any time step, for any sample as long as the input feature vector is all zeros. Example, at time step 1000, I have a batch size of 34, but samples number 33 and 34 of this batch lack data or feature values at time step 1000.
I have found that we can use the method Masking()(inputs) in Keras as long as all subsequent layers or operations support masking. But I have implemented my model in tensorflow. So what is the equivalence of Masking() in tensorflow?
Second, how can I know whether: batch normalization, conv layer and any non linear activation function has support for the masking() function in Keras?
Your help is much appreciated!!
So I found the detailed solution in danijar blog https://danijar.com/variable-sequence-lengths-in-tensorflow/.
The masking in keras is used when having incomplete sequences. So usually, you need to pad your sequences with 0.0 in the third dimension (The feature's dimension; when the input dimension has shape = [batch_size, sequence_length, num_features]).Afterwards, the masking in keras will take a number, will output 0 for their activations.
In summary: He showed how to compute the sequence length for each sample in the batch using length() he implemented. The output vector is then fed into the dynamic_rnn which will output zero vectors for incomplete sequences (for states and outputs), which is somehow similar to what happens in Keras Masking() function. Second, we should use a mask when computing the loss function.
All the details are discussed in this blog post.
But regarding the support thingy for masking in batch_norm, conv and non linear activation function; usually, if the output of the LSTM is zeros; then in case with sigmoid activation function at the output; the derivative of the output with respect to the input of the sigmoid function is output(1 - output). Hence, when the output is 0, this derivative is zero as well. And since back propagation applies the chain rule, then the gradients of the current sample with respect to any weight parameter in the network is going to be 0 as well. Hence, there is no need to worry about the support thingy... But the problem arises when the activation is relu for example, this is when the gradients should be explicitely multiplied by zeros before doing the back propagation (I guess). Maybe doing something like this will help:
final_output = output * mask
Then derivative of the final_output with respect to output will be the mask => 0 or 1 (the any time step; for any sample). Then, back propagate this gradient from the output of the activation function to its inputs...followed by chain rule => weights wont be affected in this case.

What does size of the GRU or LSTM cell in the TensorFlow seq2seq tutorial represent?

I'm working with the seq2seq model in the TensorFlow tutorials, and I'm having trouble understanding some of the details. One thing that is confusing to me is what the "size" of a cell represents. I think I have a high level understanding of images like
I believe this is showing that the output from the last step in the encoder is the input to the first step in the encoder. In this case each box is the GRU or LSTM cell at a different time-step in the sequence.
I also think I understand, at a superficial level, diagrams like this:
from colah's blog post about LSTM and GRU cells. My understanding is that a "cell" is a neural network that feeds the output from one step back into itself along with the new input for the subsequent step. The gates control how much it "remembers" and "forgets."
I think I am getting confused at the level between this superficial, high level understanding and the low-level details. It sounds like the "size" of a cell is the number of nodes in the sigmoid and tanh boxes. Is that correct? If so, how does that relate to the input size for the seq2seq model? For example, the default vocabulary size is 40,000, and the default cell size is 1024. How does the 40,000 element one-hot vocabulary vector for each step of the sequence get matched to the 1024 node internal cell size? Is that what the embedding wrapper does?
Most importantly, what effect would increasing or decreasing the size of the cell have? Would a larger cell be better at learning embeddings? Or at predicting outputs? Both?
It sounds like the "size" of a cell is the number of nodes in the
sigmoid and tanh boxes. Is that correct?
The size of the cell is the size of the RNN state vector h. In the case of LSTM it's also the size of c. It's not "the number of nodes" (I'm not sure what you mean by nodes).
If so, how does that relate to the input size for the seq2seq model?
For example, the default vocabulary size is 40,000, and the default
cell size is 1024. How does the 40,000 element one-hot vocabulary
vector for each step of the sequence get matched to the 1024 node
internal cell size?
The input size for the model is independent of the state size. The two vectors (input and state) are concatenated and multiplied by a matrix of shape [state_size + input_size, state_size] to get the next state (simplified version).
Is that what the embedding wrapper does?
No, the embedding is the result of multiplying the 1-hot input vector with a matrix of size [vocab_size, input_size], before doing the multiplication.

patch-wise training and fully convolutional training in FCN

In the FCN paper, the authors discuss the patch wise training and fully convolutional training. What is the difference between these two?
Please refer to section 4.4 attached in the following.
It seems to me that the training mechanism is as follows,
Assume the original image is M*M, then iterate the M*M pixels to extract N*N patch (where N<M). The iteration stride can some number like N/3 to generate overlapping patches. Moreover, assume each single image corresponds to 20 patches, then we can put these 20 patches or 60 patches(if we want to have 3 images) into a single mini-batch for training. Is this understanding right? It seems to me that this so-called fully convolutional training is the same as patch-wise training.
The term "Fully Convolutional Training" just means replacing fully-connected layer with convolutional layers so that the whole network contains just convolutional layers (and pooling layers).
The term "Patchwise training" is intended to avoid the redundancies of full image training.
In semantic segmentation, given that you are classifying each pixel in the image, by using the whole image, you are adding a lot of redundancy in the input. A standard approach to avoid this during training segmentation networks is to feed the network with batches of random patches (small image regions surrounding the objects of interest) from the training set instead of full images. This "patchwise sampling" ensures that the input has enough variance and is a valid representation of the training dataset (the mini-batch should have the same distribution as the training set). This technique also helps to converge faster and to balance the classes. In this paper, they claim that is it not necessary to use patch-wise training and if you want to balance the classes you can weight or sample the loss.
In a different perspective, the problem with full image training in per-pixel segmentation is that the input image has a lot of spatial correlation. To fix this, you can either sample patches from the training set (patchwise training) or sample the loss from the whole image. That is why the subsection is called "Patchwise training is loss sampling".
So by "restricting the loss to a randomly sampled subset of its spatial terms excludes patches from the gradient computation." They tried this "loss sampling" by randomly ignoring cells from the last layer so the loss is not calculated over the whole image.