what is the difference between tf.nn.dynamic_rnn and tf.nn.raw_rnn in tensorflow? - tensorflow

I went through this tutorial . In the last block it says that dynamic_rnn function cannot apply to calculate attention. But what I don't understand is all we need is the hidden state of the decoder in order to find the attention which will be work out with encoder symbols.

Attention mechanism in the context of encoder-decoder means that decoder at each time step "attends" to the "useful" parts of the encoder. This is implemented as, for example, averaging encoder's outputs and feeding that value (called context) into a decoder at a given time step.
dynamic_rnn computes outputs of LSTM cells across all time steps and gives you the final value. So, there is no way to tell the model that the cell state at time step t should depend not only on the output of the previous cell and input, but also on additional information such as context. You can control computation at each time step of encoder or decoder LSTM using raw_rnn.
If I understand correctly, in this tutorial the author feeds ground truth input as input to the decoder at each time step. However, this is not the usual way it is done. Usually, you want to feed the output of decoder at time t as input to decoder at time t+1. In short, the input to the decoder at each time step is variable, whereas in dynamic_rnn it is predefined.
Refer to for more technical details: https://arxiv.org/abs/1409.0473


What's the attention model used in tfjs-examples/date-conversion-attention?

I've been looking at tfjs examples and trying to learn about seq2seq models. During the process, I've stumbled upon the date-conversion-attention example.
It's a great example but what kind of attention mechanism is being used in the example? There is no info in Readme file. Can somebody point me to the paper that describes the attention that's being used here?
Link to attention part:
I believe I found the answer.
The attention model used in the date-conversion-attention uses the dot product alignment score and it's described in Effective Approaches to Attention-based Neural Machine Translation. Link: https://arxiv.org/pdf/1508.04025.pdf
I have twisted my head around this sample for some hours now, and this what I have concluded so far:
The encoder looks at the full input, one character-embedding for each lstm-step. The decoder expects a time-shifted copy of the output as its input -starting with a special character. The output (target strings) are provided as-is to the decoder during training. During evaluation, one character is predicted at the time, passing the prediction back into the decoder for the next character.
The decoder does not see the input, but it receives the encoder's final step output as its initial state. This state initialisation tells the decoder how to produce it's outputs, something like an encoded description of the date-format to work on (I assume).
The LSTM's output, one for each step (=character of input or output), from the encoder and decoder are then dot product'ed and normalised with softmax. This dot-product is the attention matrix - basically a highlight of the activations from the encoder and the decoder. For the attention heatmap to light up for the given next character, the decoder must have output'ed something that "matches" the encoder's outputs. The attention matrix is not learned weights or biases, its just a product of the encoder and decoder's outputs.
Finally this attention matrix is dot product'ed with the full encoder input and concatenated with the decoder output - to allow the final dense layers to decode the attention mappings and "read" the right values from the encoder output.
In the prediction process, only the last character is read from the prediction. Possibly because the previous predictions might be unstable?
I read the excellent book: Deep Learning with JavaScript Neural networks in TensorFlow.js The book explains the examples one by one and adds lots of extra documentation. But I don't think they explain the general architecture very well, for this sample - only the details.

Seq2Seq Models for Chatbots

I am building a chat-bot with a sequence to sequence encoder decoder model as in NMT. From the data given I can understand that when training they feed the decoder outputs into the decoder inputs along with the encoder cell states. I cannot figure out that when i am actually deploying a chatbot in real time, how what should I input into the decoder since that time is the output that i have to predict. Can someone help me out with this please?
The exact answer depends on which building blocks you take from Neural Machine Translation model (NMT) and which ones you would replace with your own. I assume the graph structure exactly as in NMT.
If so, at inference time, you can feed just a vector of zeros to the decoder.
Internal details: NMT uses the entity called Helper to determine the next input in the decoder (see tf.contrib.seq2seq.Helper documentation).
In particular, tf.contrib.seq2seq.BasicDecoder relies solely on helper when it performs a step: the next_inputs that the are fed in to the subsequent cell is exactly the return value of Helper.next_inputs().
There are different implementations of Helper interface, e.g.,
tf.contrib.seq2seq.TrainingHelper is returning the next decoder input (which is usually ground truth). This helper is used in training as indicated in the tutorial.
tf.contrib.seq2seq.GreedyEmbeddingHelper discards the inputs, and returns the argmax sampled token from the previous output. NMT uses this helper in inference when sampling_temperature hyper-parameter is 0.
tf.contrib.seq2seq.SampleEmbeddingHelper does the same, but samples the token according to categorical (a.k.a. generalized Bernoulli) distribution. NMT uses this helper in inference when sampling_temperature > 0.
The code is in BaseModel._build_decoder method.
Note that both GreedyEmbeddingHelper and SampleEmbeddingHelper don't care what the decoder input is. So in fact you can feed anything, but the zero tensor is the standard choice.

how does masking work in a recurrent model in keras?

I found a nicely trained LSTM-based network.
The network allows for masking.
for l in range(len(model.layers)):
is True for me for all the 'name' beside the input layers.
I also have a time serie with missing timestamps, which I replace by the correct mask_value.
Is the network using all the masked_values as other ordinary values to determine the final prediction, so all the computation of the forward pass are actually executed (example update of the state in an LSTM for each timestamp in input) or the masked samples are completely skipped so the computation never take places?
Keras will skip time steps, as said in the documentation.

Trouble understanding tf.contrib.seq2seq.TrainingHelper

I managed to build a sequence to sequence model in tensorflow using the tf.contrib.seq2seq classes in 1.1 version.
For know I use the TrainingHelper for training my model.
But does this helper feed previously decoded values in the decoder for training or just the ground truth?
If it doesn't how can I feed previously decoded value as input in the decoder instead of ground truth values ?
TrainingHelper feeds the ground truth at every step. If you want to use decoder outputs, you can use scheduled sampling [1]. Scheduled sampling is implemented in ScheduledEmbeddingTrainingHelper and ScheduledOutputTrainingHelper, so you can use one of the two (depending on your particular application) instead of TrainingHelper. See also this thread here:
scheduled sampling in Tensorflow.
[1] https://arxiv.org/pdf/1506.03099.pdf

What is a dynamic RNN in TensorFlow?

I am confused about what dynamic RNN (i.e. dynamic_rnn) is. It returns an output and a state in TensorFlow. What are these state and output? What is dynamic in a dynamic RNN, in TensorFlow?
Dynamic RNN's allow for variable sequence lengths. You might have an input shape (batch_size, max_sequence_length), but this will allow you to run the RNN for the correct number of time steps on those sequences that are shorter than max_sequence_length.
In contrast, there are static RNNs, which expect to run the entire fixed RNN length. There are cases where you might prefer to do this, such as if you are padding your inputs to max_sequence_length anyway.
In short, dynamic_rnn is usually what you want for variable length sequential data. It has a sequence_length parameter, and it is your friend.
While AlexDelPiero's answer was what I was googling for, the original question was different. You can take a look at this detailed description about LSTMs and intuition behind them. LSTM is the most common example of an RNN.
The short answer is: the state is an internal detail that is passed from one timestep to another. The output is a tensor of outputs on each timestep. You usually need to pass all outputs to the next RNN layer or the last output for the last RNN layer. To get the last output you can use output[:,-1,:]