Trouble understanding tf.contrib.seq2seq.TrainingHelper - tensorflow

I managed to build a sequence to sequence model in tensorflow using the tf.contrib.seq2seq classes in 1.1 version.
For know I use the TrainingHelper for training my model.
But does this helper feed previously decoded values in the decoder for training or just the ground truth?
If it doesn't how can I feed previously decoded value as input in the decoder instead of ground truth values ?

TrainingHelper feeds the ground truth at every step. If you want to use decoder outputs, you can use scheduled sampling [1]. Scheduled sampling is implemented in ScheduledEmbeddingTrainingHelper and ScheduledOutputTrainingHelper, so you can use one of the two (depending on your particular application) instead of TrainingHelper. See also this thread here:
scheduled sampling in Tensorflow.


Seq2Seq Models for Chatbots

I am building a chat-bot with a sequence to sequence encoder decoder model as in NMT. From the data given I can understand that when training they feed the decoder outputs into the decoder inputs along with the encoder cell states. I cannot figure out that when i am actually deploying a chatbot in real time, how what should I input into the decoder since that time is the output that i have to predict. Can someone help me out with this please?
The exact answer depends on which building blocks you take from Neural Machine Translation model (NMT) and which ones you would replace with your own. I assume the graph structure exactly as in NMT.
If so, at inference time, you can feed just a vector of zeros to the decoder.
Internal details: NMT uses the entity called Helper to determine the next input in the decoder (see tf.contrib.seq2seq.Helper documentation).
In particular, tf.contrib.seq2seq.BasicDecoder relies solely on helper when it performs a step: the next_inputs that the are fed in to the subsequent cell is exactly the return value of Helper.next_inputs().
There are different implementations of Helper interface, e.g.,
tf.contrib.seq2seq.TrainingHelper is returning the next decoder input (which is usually ground truth). This helper is used in training as indicated in the tutorial.
tf.contrib.seq2seq.GreedyEmbeddingHelper discards the inputs, and returns the argmax sampled token from the previous output. NMT uses this helper in inference when sampling_temperature hyper-parameter is 0.
tf.contrib.seq2seq.SampleEmbeddingHelper does the same, but samples the token according to categorical (a.k.a. generalized Bernoulli) distribution. NMT uses this helper in inference when sampling_temperature > 0.
The code is in BaseModel._build_decoder method.
Note that both GreedyEmbeddingHelper and SampleEmbeddingHelper don't care what the decoder input is. So in fact you can feed anything, but the zero tensor is the standard choice.

how does masking work in a recurrent model in keras?

I found a nicely trained LSTM-based network.
The network allows for masking.
for l in range(len(model.layers)):
is True for me for all the 'name' beside the input layers.
I also have a time serie with missing timestamps, which I replace by the correct mask_value.
Is the network using all the masked_values as other ordinary values to determine the final prediction, so all the computation of the forward pass are actually executed (example update of the state in an LSTM for each timestamp in input) or the masked samples are completely skipped so the computation never take places?
Keras will skip time steps, as said in the documentation.

Tensorflow input pipeline

I have an input pipeline where samples are generated on fly. I use keras and custom ImageDataGenerator and corresponding Iterator to get samples in memory.
Under assumption that keras in my setup is using feed_dict (and that assumption is a question to me) I am thinking of speeding things up by switching to raw tensorflow + Dataset.from_generator().
Here I see that suggested solution for input pipelines that generate data on fly in the most recent Tensorflow is to use Dataset.from_generator().
Does keras with Tensorflow backend use feed_dict method?
If I switch to raw tensorflow + Dataset.from_generator(my_sample_generator) will that cut feed_dict memory copy overhead and buy me performance?
During predict (evaluation) phase apart from batch_x, batch_y I have also opaque index vector from my generator output. That vector corresponds to sample ids in the batch_x. Does that mean that I'm stuck with feed_dict approach for predict phase because I need that extra batch_z output from iterator?
The new can potentially speed up your input pipeline by overlapping the data preparation with training. However, you will tend to get the best performance by switching over to TensorFlow ops in your input pipeline wherever possible.
To answer your specific questions:
The Keras TensorFlow backend uses tf.placeholder() to represent compiled function inputs, and feed_dict to pass arguments to a function.
With the recent optimizations to tf.py_func() and feed_dict copy overhead, I suspect the amount of time spent in memcpy() will be the same. However, you can more easily use Dataset.from_generator() with Dataset.prefetch() to overlap the training on one batch with preprocessing on the next batch.
It sounds like you can define a separate iterator for the prediction phase. The tf.estimator.Estimator class does something similar by instantiating different "input functions" with different signatures for training and evaluation, then building a separate graph for each role.
Alternatively, you could add a dummy output to your training iterator (for the batch_z values) and switch between training and evaluation iterators using a "feedable iterator".

what is the difference between tf.nn.dynamic_rnn and tf.nn.raw_rnn in tensorflow?

I went through this tutorial . In the last block it says that dynamic_rnn function cannot apply to calculate attention. But what I don't understand is all we need is the hidden state of the decoder in order to find the attention which will be work out with encoder symbols.
Attention mechanism in the context of encoder-decoder means that decoder at each time step "attends" to the "useful" parts of the encoder. This is implemented as, for example, averaging encoder's outputs and feeding that value (called context) into a decoder at a given time step.
dynamic_rnn computes outputs of LSTM cells across all time steps and gives you the final value. So, there is no way to tell the model that the cell state at time step t should depend not only on the output of the previous cell and input, but also on additional information such as context. You can control computation at each time step of encoder or decoder LSTM using raw_rnn.
If I understand correctly, in this tutorial the author feeds ground truth input as input to the decoder at each time step. However, this is not the usual way it is done. Usually, you want to feed the output of decoder at time t as input to decoder at time t+1. In short, the input to the decoder at each time step is variable, whereas in dynamic_rnn it is predefined.
Refer to for more technical details:

How to get both loss and model output at once, on a batch of data in Keras?

I'm using Keras w/ Tensorflow backend to train a NN.
I'm using train_on_batch for training, which returns the loss on the given batch. How do I also get the output classification on that batch ? (I'd like to do some visualisations of the output)
To do that I currently do another call to predict to get the model output, but that's redundant since train_on_batch have already passed the input batch "forward".
In Caffe, when an image is fed forward, the intermediate layer outputs stay stored in net.blobs, but in Keras/Tensorflow it seems that if we want to get an intermediate output we have to rerun the computational graph for each intermediate output we want to access on CPU, as described here. Is there a way to access many/all intermediate layers' outputs without rerunning the graph for each ?
I don't mind having a tensorflow-specific workaround.
If you use the function API, this is pretty straight forward.
In addition to #MohamedEzz's answer, you can create a custom callback which can perform the operations you require during the training process. They have methods which will run your code onEpochEnd, onEpochStart, onTrainingEnd and so on...
This way you can preserve the batch.