I have been building a word-based neural machine translation model in Tensorflow using LSTMs. I have been following a couple of tutorials, including:
https://towardsdatascience.com/implementing-neural-machine-translation-using-keras-8312e4844eb8
My question is specifically about how the final Dense layer (with softmax activation) works.
All the words in the corpus are assigned to an integer. No word is assigned to the integer 0.
When you get your output from the final Dense (+ softmax) layer, what happens if index 0 has the maximum value? How does Tensorflow interpret this? No word in the target language has been assigned to the index 0. Yet this output needs to be fed as the input for the next time-step.
Could someone explain what's going on here?
Related
From https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense (*Note Section)
For an input of (batch_size, d0, d1) why is the same (d1, units) kernel used for every sub-tensor (1, 1, d1)?
Additionally, why is the higher dimension dense layer operation broken down to work on subsets of input nodes, instead of having a weight from all d0xd1 inputs to an output node?
I apologize if I am missing something obvious and thank you for any help!
So I dug through the source code and found Tensorflow made a change in how they implement the dense operation and I asked my boss at uni why they made this change.
In tf1 for input > rank 2 they flattened the input and just did a regular 1-D dense operation.
https://github.com/tensorflow/tensorflow/blob/r1.15/tensorflow/python/keras/layers/core.py
In tf2 for input > rank 2 they use the tensordot operation. This uses a smaller kernel and shares it for all input sub-tensors. This has the effect of sharing the learned channel-wise information.
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/keras/layers/ops/core.py
As part of my thesis, I am trying to build a recurrent Neural Network Language Model.
From theory, I know that the input layer should be a one-hot vector layer with a number of neurons equal to the number of words of our Vocabulary, followed by an Embedding layer, which, in Keras, it apparently translates to a single Embedding layer in a Sequential model. I also know that the output layer should also be the size of our vocabulary so that each output value maps 1-1 to each vocabulary word.
However, in both the Keras documentation for the Embedding layer (https://keras.io/layers/embeddings/) and in this article (https://machinelearningmastery.com/how-to-develop-a-word-level-neural-language-model-in-keras/#comment-533252), the vocabulary size is arbitrarily augmented by one for both the input and the output layers! Jason gives an explenation that this is due to the implementation of the Embedding layer in Keras but that doesn't explain why we would also use +1 neuron in the output layer. I am at the point of wanting to order the possible next words based on their probabilities and I have one probability too many that I do not know to which word to map it too.
Does anyone know what is the correct way of acheiving the desired result? Did Jason just forget to subtrack one from the output layer and the Embedding layer just needs a +1 for implementation reasons (I mean it's stated in the official API)?
Any help on the subject would be appreciated (why is Keras API documentation so laconic?).
Edit:
This post Keras embedding layer masking. Why does input_dim need to be |vocabulary| + 2? made me think that Jason does in fact have it wrong and that the size of the Vocabulary should not be incremented by one when our word indices are: 0, 1, ..., n-1.
However, when using Keras's Tokenizer our word indices are: 1, 2, ..., n. In this case, the correct approach is to:
Set mask_zero=True, to treat 0 differently, as there is never a
0 (integer) index input in the Embedding layer and keep the
vocabulary size the same as the number of vocabulary words (n)?
Set mask_zero=True but augment the vocabulary size by one?
Not set mask_zero=True and keep the vocabulary size the same as the
number of vocabulary words?
the reason why we add +1 leads to the possibility that we can encounter a chance to see an unseen word(out of our vocabulary) during testing or in production, it is common to consider a generic term for those UNKNOWN and that is why we add a OOV word in front which resembles all out of vocabulary words.
Check this issue on github which explains it in detail:
https://github.com/keras-team/keras/issues/3110#issuecomment-345153450
I am training a GRU layer where inputs doesn't have the same length. Therefore, I have padded the inputs' features with 0.0 to make all sequences of same length. On the other hand, I don't want to compute any loss at any time step, for any sample as long as the input feature vector is all zeros. Example, at time step 1000, I have a batch size of 34, but samples number 33 and 34 of this batch lack data or feature values at time step 1000.
I have found that we can use the method Masking()(inputs) in Keras as long as all subsequent layers or operations support masking. But I have implemented my model in tensorflow. So what is the equivalence of Masking() in tensorflow?
Second, how can I know whether: batch normalization, conv layer and any non linear activation function has support for the masking() function in Keras?
Your help is much appreciated!!
So I found the detailed solution in danijar blog https://danijar.com/variable-sequence-lengths-in-tensorflow/.
The masking in keras is used when having incomplete sequences. So usually, you need to pad your sequences with 0.0 in the third dimension (The feature's dimension; when the input dimension has shape = [batch_size, sequence_length, num_features]).Afterwards, the masking in keras will take a number, will output 0 for their activations.
In summary: He showed how to compute the sequence length for each sample in the batch using length() he implemented. The output vector is then fed into the dynamic_rnn which will output zero vectors for incomplete sequences (for states and outputs), which is somehow similar to what happens in Keras Masking() function. Second, we should use a mask when computing the loss function.
All the details are discussed in this blog post.
But regarding the support thingy for masking in batch_norm, conv and non linear activation function; usually, if the output of the LSTM is zeros; then in case with sigmoid activation function at the output; the derivative of the output with respect to the input of the sigmoid function is output(1 - output). Hence, when the output is 0, this derivative is zero as well. And since back propagation applies the chain rule, then the gradients of the current sample with respect to any weight parameter in the network is going to be 0 as well. Hence, there is no need to worry about the support thingy... But the problem arises when the activation is relu for example, this is when the gradients should be explicitely multiplied by zeros before doing the back propagation (I guess). Maybe doing something like this will help:
final_output = output * mask
Then derivative of the final_output with respect to output will be the mask => 0 or 1 (the any time step; for any sample). Then, back propagate this gradient from the output of the activation function to its inputs...followed by chain rule => weights wont be affected in this case.
I'm using google's seq2seq library and based on some computations that I do between the predictions and the input I would like to zero out some of the losses for certain time steps (not for padding).
What I do is basically to go through each batch then through each time step of the decoder (logits) and for eatc time step I add "a zero or a one" to a list (based on my computation). This list should then be converted to a tensor and multiplied by the losses.
My problem is the shape of the tensor reurned by the sparse_softmax_cross_entropy_with_logits is variable, its not always the shape of the target tensor. So there is a mismatch of dimensions. Has anyone does something like this before and can share it, or know why this happens.
I've been going through the docs recently and in many different functions like tf.layers.dense or the tf.nn.conv2d, I came across with the arguments units and filters respectively and I can't understand the point of them. Can someone clearly describe the meaning of
dimensionality of the output space
in the above cases or maybe more general terms? Thanks in advance.
from my opinion:
units in tf.layers.dense:
means that how many output nodes of dense layer should be returned.
Because the fully connected layer(dense layer) should consist of input and output.
Then , the mean of dimensionality of the output space could be translated to the number of ouput nodes.
if the units = 1 , it means all the input nodes connected to one output nodes
in inception v3 or other classifier model, we could found the units of dense layer always be the classifier number.
filters in tf.nn.conv2d:
like the state in api doc :
filter: A Tensor. Must have the same type as input. A 4-D tensor of shape [filter_height, filter_width, in_channels, out_channels]
maybe the confused point is out_channels
for out_channels , I try to understand it as how many filters we want to scan the input tensors.
so out_channels is regarded as the number of kernel.