In tf.contrib.seq2seq.BasicDecoder document, it says output_layer: (Optional) An instance of tf.layers.Layer, i.e., tf.layers.Dense. Optional layer to apply to the RNN output prior to storing the result or sampling.
In tf.contrib.rnn.OutputProjectionWrapper is Operator adding an output projection to the given cell
I assume these are same?
Related
From the documentation on tf.keras.layers.Embedding :
input_dim:
Integer. Size of the vocabulary, i.e. maximum integer index + 1.
mask_zero:
Boolean, whether or not the input value 0 is a special “padding” value that should be masked
out. This is useful when using recurrent layers which may take variable length input. If this
is True, then all subsequent layers in the model need to support masking or an exception will
be raised. If mask_zero is set to True, as a consequence, index 0 cannot be used in the
vocabulary (input_dim should equal size of vocabulary + 1).
I was reading this answer but I'm still confused. If my vocabulary size is n but they are encoded with index values from 1 to n (0 is left for padding), is input_dim equal to n or n+1?
If the inputs are padded with zeroes, what are the consequences of leaving mask_zero = False?
If mask_zero = True, based on the documentation, I would have to increment the answer from my first question by one? What is the expected behaviour if this was not done?
I am basically just trying to rephrase parts of the linked answer to make it a bit more understandable in the current context, and also address your other subquestions (which technically should be their own questions, according to [ask]).
It does not matter whether you actually use 0 for padding or not, Keras assumes that you will start indexing from zero and will have to "brace itself" for an input value of 0 in your data. Therefore, you need to choose the value as n+1, because you are essentially just adding a specific value to your vocabulary that you previously didn't consider.
I think this is out of scope for this question to discuss in detail, but - depending on the exact model - the loss values on padded positions do not affect the backpropagation. However, if you choose mask_zero = False, your model will essentially have to correctly predict padding on all those positions (where the padding then also affects the training).
This relates to my illustration: Essentially, you are adding a new vocabulary index. If you do not adjust your dimension, there will likely be an indexing error (out of range) for the vocabulary entry with the highest index (n). Otherwise, you would likely not notice any different behavior.
How about this thought experiment. You have an LSTM network to encode a sequence of integers (inputs are x_t). For each cell, the weights, biases and outputs of the gates, as well as the previous states (C_t-1, h_t-1) are calculated and stored for the forward pass. Each cell has, thus, a certain hard-coded pattern. We'll use this image for a visualization aid (from Wikipedia).
I now want to take a cell with all its calculated values, and switch the input with the output. The new input is a new value that gets entered into h_t, let's call it h'_t. The computation is inverted to arrive at a new value for x_t that we'll call x'_t. Does the cell keep its properties so that there is a (non-linear) relationship between the output of a cell and the input, regardless of what the output value I enter into it is? This calculation would then be performed for the whole sequence of LSTM cells, but each cell would get a different h'_t value as an input.
Furthermore, would it be enough to only calculate it backwards for the output gate, since one can arrive at x'_t either through the output gate, the input gate, or the forget gate?
In tensorflow.contrib.seq2seq's AttentionWrapper, what does "depth" refer to as stated in the attention_layer_size documentation? When the documentation says to "use the context as attention" if the value is None, what is meant by "the context"?
In Neural Machine Translation by Jointly Learning to Align and Translate they give a description of the (Bahdanau) attention mechanism; essentially what happens is that you compute scalar "alignment scores" a_1, a_2, ..., a_n that indicate how important each element of your encoded input sequence is at a given moment in time (i.e. which part of the input sentence you should pay attention to right now in the current timestep).
Assuming your (encoded) input sequence that you want to "pay attention"/"attend over" is a sequence of vectors denoted as e_1, e_2, ..., e_n, the context vector at a given timestep is the weighted sum over all of these as determined by your alignment scores:
context = c := (a_1*e_1) + (a_2*e_2) + ... + (a_n*e_n)
(Remember that the a_k's are scalars; you can think of this as an "averaged-out" letter/word in your sentence --- so ideally if your model is trained well, the context looks most similar to the e_i you want to pay attention to the most, but bears a little bit of resemblance to e_{i-1}, e_{i+1}, etc. Intuitively, think of a "smeared-out" input element, if that makes any sense...)
Anyway, if attention_layer_size is not None, then it specifies the number of hidden units in a feedforward layer within your decoder that is used to mix this context vector with the output of the decoder's internal RNN cell to get the attention value. If attention_layer_size == None, it just uses the context vector above as the attention value, and no mixing of the internal RNN cell's output is done. (When I say "mixing", I mean that the context vector and the RNN cell's output are concatenated and then projected to the dimensionality that you specify by setting attention_layer_size.)
The relevant part of the implementation is at this line and has a description of how it's computed.
Hope that helps!
I'm using tf.nn.bidirectional_rnn with the sequence_length parameter for variable input size, and I can't figure out how to get the final output for each sample in the minibatch:
output, _, _ = tf.nn.bidirectional_rnn(forward1,backward1,input,dtype=tf.float32,sequence_length=input_lengths)
Now, if I had constant sequence lengths, I would simply use output[-1] and get the final output. In my case I have variable sequences (their lengths are known).
Also, is this output the output of both forward and backward LSTMs?
Thanks.
This question can be answered by looking at the source code rnn.py.
For sequences with dynamic length, the source code says:
If the sequence_length vector is provided, dynamic calculation is
performed. This method of calculation does not compute the RNN steps
past the maximum sequence length of the minibatch (thus saving
computational time), and properly propagates the state at an
example's sequence length to the final state output.
Therefore, in order to get the actual last output, you should slice the resulting output.
For bidirectional_rnn, the source code says:
A tuple (outputs, output_state_fw, output_state_bw) where:
outputs is a length T list of outputs (one for each input), which
are depth-concatenated forward and backward outputs.
output_state_fw is the final state of the forward rnn.
output_state_bw is the final state of the backward rnn.
Therefore, the output is a tuple rather than a tensor.
You can concatenate this tuple into a vector if you wish.
I am a newbie to torch and lua (as anyone who has been following my latest posts could attest :) and have the following question on the forward function for the gmodule object (class nngraph).
as per the source code (https://github.com/torch/nn/blob/master/Module.lua - as class gmodule inherits from nn.module) the syntax is:
function Module:forward(input)
return self:updateOutput(input)
end
However, I have found cases where a table is passed as input, as in:
local lst = clones.rnn[t]:forward{x[{{}, t}], unpack(rnn_state[t-1])}
where:
clones.rnn[t]
is itself a gmodule object. In turn, rnn_state[t-1] is a table with 4 tensors. So in the end, we have something akin to
result_var = gmodule:forward{[1]=tensor_1,[2]=tensor_2,[3]=tensor_3,...,[5]=tensor_5}
The question is, depending on the network architecture, can you pass input - formatted as table - not only to the input layer but also to the hidden layers?
In that case, you have to check that you pass exactly one input per layer? (with the exception of the output layer)
Thanks so much
I finally found the answer. The module class (as well as the inherited class gmodule) has an input and an output.
However, the input (as well as the output) needs not be a vector, but it could be a collection of vectors - that depends on the neural net configuration, in this particular case it is a pretty complex recursive neural net.
So if the net has more than one input vector, you can do:
result_var = gmodule:forward{[1]=tensor_1,[2]=tensor_2,[3]=tensor_3,...,[5]=tensor_5}
where each tensor/vector is one of the input vectors. Only one of those vectors is the X vector, or the feature vector. The others could serve as input to other intermediate nodes.
In turn, result_var (which is the output) can have one output as tensor (the prediction) or a collection of tensors as output (a collection of tensors), depending on the network configuration.
If the latter is the case, one of those output tensors is the prediction, and the reminder are usually used as input to the intermediate nodes in the next time step - but that again depends on the net configuration.