Implementing attention with beam search in tensorflow - tensorflow

I have written my own code with reference to this wonderful tutorial and I am not able to get results when using attention with beam search as per my understanding in the class AttentionModel the _build_decoder_cell function creates separate decoder cell and attention wrapper for inference mode , assuming this ( which i think is incorrect and cant find a way around it ) ,
with tf.name_scope("Decoder"):
mem_units = 2*dim
dec_cell = tf.contrib.rnn.BasicLSTMCell( 2*dim )
beam_cel = tf.contrib.rnn.BasicLSTMCell( 2*dim )
beam_width = 3
out_layer = Dense( output_vocab_size )
with tf.name_scope("Training"):
attn_mech = tf.contrib.seq2seq.BahdanauAttention( num_units = mem_units, memory = enc_rnn_out, normalize=True)
attn_cell = tf.contrib.seq2seq.AttentionWrapper( cell = dec_cell,attention_mechanism = attn_mech )
batch_size = tf.shape(enc_rnn_out)[0]
initial_state = attn_cell.zero_state( batch_size = batch_size , dtype=tf.float32 )
initial_state = initial_state.clone(cell_state = enc_rnn_state)
helper = tf.contrib.seq2seq.TrainingHelper( inputs = emb_x_y , sequence_length = seq_len )
decoder = tf.contrib.seq2seq.BasicDecoder( cell = attn_cell, helper = helper, initial_state = initial_state ,output_layer=out_layer )
outputs, final_state, final_sequence_lengths= tf.contrib.seq2seq.dynamic_decode(decoder=decoder,impute_finished=True)
training_logits = tf.identity(outputs.rnn_output )
training_pred = tf.identity(outputs.sample_id )
with tf.name_scope("Inference"):
enc_rnn_out_beam = tf.contrib.seq2seq.tile_batch( enc_rnn_out , beam_width )
seq_len_beam = tf.contrib.seq2seq.tile_batch( seq_len , beam_width )
enc_rnn_state_beam = tf.contrib.seq2seq.tile_batch( enc_rnn_state , beam_width )
batch_size_beam = tf.shape(enc_rnn_out_beam)[0] # now batch size is beam_width times
# start tokens mean be the original batch size so divide
start_tokens = tf.tile(tf.constant([27], dtype=tf.int32), [ batch_size_beam//beam_width ] )
end_token = 0
attn_mech_beam = tf.contrib.seq2seq.BahdanauAttention( num_units = mem_units, memory = enc_rnn_out_beam, normalize=True)
cell_beam = tf.contrib.seq2seq.AttentionWrapper(cell=beam_cel,attention_mechanism=attn_mech_beam,attention_layer_size=mem_units)
initial_state_beam = cell_beam.zero_state(batch_size=batch_size_beam,dtype=tf.float32).clone(cell_state=enc_rnn_state_beam)
my_decoder = tf.contrib.seq2seq.BeamSearchDecoder( cell = cell_beam,
embedding = emb_out,
start_tokens = start_tokens,
end_token = end_token,
initial_state = initial_state_beam,
beam_width = beam_width
,output_layer=out_layer)
beam_output, t1 , t2 = tf.contrib.seq2seq.dynamic_decode( my_decoder,
maximum_iterations=maxlen )
beam_logits = tf.no_op()
beam_sample_id = beam_output.predicted_ids
when i call beam _sample_id after training i am not getting correct result.
my guess is that we are supposed to using the same attention wrapper but that is not possible since we have to tile_sequence for beam search to be used.
Any insights / suggestion would be much appreciated.
i have also created an issue for this in their main repository Issue-93

I'm not sure what do you mean by "I am not able to get results" but I'm assuming that your model is not making use of the wieghts learnt while training .
if this is the case , then first of all you need to know that its all about variable sharing , the first thing you need to do is that you get rid of the different variable scopes between the training and inferring and instead you need to use some thing like
remove the
with tf.name_scope("Training"):
and use :
with tf.variable_scope("myScope"):
and then remove the
with tf.name_scope("Inference"):
and use instead
with tf.variable_scope("myScope" , reuse=True):
also at the beginning of your and after with tf.variable_scope("myScope" )
enc_rnn_out = tf.contrib.seq2seq.tile_batch( enc_rnn_out , 1 )
seq_len = tf.contrib.seq2seq.tile_batch( seq_len , 1 )
enc_rnn_state = tf.contrib.seq2seq.tile_batch( enc_rnn_state , 1 )
this will insure that your inference variables and training variables have the same signature and are shared ,
I have tested this when I was following the same tutorial that you have mentioned , my model is still training as I'm writing this post but I can see that the accuracy increasing as we speak , which indicates that the solution should work for you as well .
thank you

You can use tf.cond() to create different pathes between training and inference stage:
def get_tile_batch(enc_output, source_sequence_length, enc_state, useBeamSearch):
enc_output = tf.contrib.seq2seq.tile_batch(enc_output, multiplier=useBeamSearch)
source_sequence_length = tf.contrib.seq2seq.tile_batch(source_sequence_length, multiplier=useBeamSearch)
enc_state = tf.contrib.seq2seq.tile_batch(enc_state, multiplier=useBeamSearch)
return enc_output, source_sequence_length, enc_state
## for beam search: at training stage, use tile_batch multiplier = 1,
## at infer stage, use tile_batch multiplier = useBeamSearch
## tile_batch is just duplicate every sample in a batch,
## so it'll change batch_size to batch_size * useBeamSearch at runtime once batch_size was determined
enc_output, source_sequence_length, enc_state = tf.cond(
self.on_infer, # is inference stage?
lambda: get_tile_batch(enc_output, source_sequence_length, enc_state, useBeamSearch=useBeamSearch),
lambda: get_tile_batch(enc_output, source_sequence_length, enc_state, useBeamSearch=1)
)
# attention mechanism
attention_mechanism = tf.contrib.seq2seq.BahdanauAttention(num_units=rnn_size, memory=enc_output, memory_sequence_length=source_sequence_length)
dec_cell = tf.contrib.seq2seq.AttentionWrapper(dec_cell, attention_mechanism)
## for beam search: change batch_size to batch_size * useBeamSearch at infer stage
decoder_initial_state = tf.cond(
self.on_infer, # is inference stage?
lambda: dec_cell.zero_state(batch_size=batch_size * useBeamSearch, dtype=tf.float32),
lambda: dec_cell.zero_state(batch_size=batch_size * 1, dtype=tf.float32)
)
enc_state = decoder_initial_state.clone(cell_state=enc_state)

Related

How to build a custom question-answering head when using hugginface transformers?

Using the TFBertForQuestionAnswering.from_pretrained() function, we get a predefined head on top of BERT together with a loss function that are suitable for this task.
My question is how to create a custom head without relying on TFAutoModelForQuestionAnswering.from_pretrained().
I want to do this because there is no place where the architecture of the head is explained clearly. By reading the code here we can see the architecture they are using, but I can't be sure I understand their code 100%.
Starting from How to Fine-tune HuggingFace BERT model for Text Classification is good. However, it covers only the classification task, which is much simpler.
'start_positions' and 'end_positions' are created following this tutorial.
So far, I've got the following:
train_dataset
# Dataset({
# features: ['input_ids', 'token_type_ids', 'attention_mask', 'start_positions', 'end_positions'],
# num_rows: 99205
# })
train_dataset.set_format(type='tensorflow', columns=['input_ids', 'token_type_ids', 'attention_mask'])
features = {x: train_dataset[x] for x in ['input_ids', 'token_type_ids', 'attention_mask']}
labels = [train_dataset[x] for x in ['start_positions', 'end_positions']]
labels = np.array(labels).T
tfdataset = tf.data.Dataset.from_tensor_slices((features, labels)).batch(16)
input_ids = tf.keras.layers.Input(shape=(256,), dtype=tf.int32, name='input_ids')
token_type_ids = tf.keras.layers.Input(shape=(256,), dtype=tf.int32, name='token_type_ids')
attention_mask = tf.keras.layers.Input((256,), dtype=tf.int32, name='attention_mask')
bert = TFAutoModel.from_pretrained("bert-base-multilingual-cased")
output = bert([input_ids, token_type_ids, attention_mask]).last_hidden_state
output = tf.keras.layers.Dense(2, name="qa_outputs")(output)
model = tf.keras.models.Model(inputs=[input_ids, token_type_ids, attention_mask], outputs=output)
num_train_epochs = 3
num_train_steps = len(tfdataset) * num_train_epochs
optimizer, schedule = create_optimizer(
init_lr=2e-5,
num_warmup_steps=0,
num_train_steps=num_train_steps,
weight_decay_rate=0.01
)
def qa_loss(labels, logits):
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(
from_logits=True, reduction=tf.keras.losses.Reduction.NONE
)
start_loss = loss_fn(labels[0], logits[0])
end_loss = loss_fn(labels[1], logits[1])
return (start_loss + end_loss) / 2.0
model.compile(
loss=loss_fn,
optimizer=optimizer
)
model.fit(tfdataset, epochs=num_train_epochs)
And I am getting the following error:
ValueError: `labels.shape` must equal `logits.shape` except for the last dimension. Received: labels.shape=(2,) and logits.shape=(256, 2)
It is complaining about the shape of the labels. This should not happen since I am using SparseCategoricalCrossentropy loss.
For future reference, I actually found a solution, which is just editing the TFBertForQuestionAnswering class itself. For example, I added an additional layer in the following code and trained the model as usual and it worked.
from transformers import TFBertPreTrainedModel
from transformers import TFBertMainLayer
from transformers.modeling_tf_utils import TFQuestionAnsweringLoss, get_initializer, input_processing
from transformers.modeling_tf_outputs import TFQuestionAnsweringModelOutput
from transformers import BertConfig
class MY_TFBertForQuestionAnswering(TFBertPreTrainedModel, TFQuestionAnsweringLoss):
# names with a '.' represents the authorized unexpected/missing layers when a TF model is loaded from a PT model
_keys_to_ignore_on_load_unexpected = [
r"pooler",
r"mlm___cls",
r"nsp___cls",
r"cls.predictions",
r"cls.seq_relationship",
]
def __init__(self, config: BertConfig, *inputs, **kwargs):
super().__init__(config, *inputs, **kwargs)
self.num_labels = config.num_labels
self.bert = TFBertMainLayer(config, add_pooling_layer=False, name="bert")
# This is the dense layer I added
self.my_dense = tf.keras.layers.Dense(
units=config.hidden_size,
kernel_initializer=get_initializer(config.initializer_range),
name="my_dense",
)
self.qa_outputs = tf.keras.layers.Dense(
units=config.num_labels,
kernel_initializer=get_initializer(config.initializer_range),
name="qa_outputs",
)
def call(
self,
input_ids = None,
attention_mask = None,
token_type_ids = None,
position_ids = None,
head_mask = None,
inputs_embeds = None,
output_attentions = None,
output_hidden_states = None,
return_dict = None,
start_positions = None,
end_positions= None,
training = False,
**kwargs,
):
r"""
start_positions (`tf.Tensor` or `np.ndarray` of shape `(batch_size,)`, *optional*):
Labels for position (index) of the start of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
are not taken into account for computing the loss.
end_positions (`tf.Tensor` or `np.ndarray` of shape `(batch_size,)`, *optional*):
Labels for position (index) of the end of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
are not taken into account for computing the loss.
"""
inputs = input_processing(
func=self.call,
config=self.config,
input_ids=input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids,
position_ids=position_ids,
head_mask=head_mask,
inputs_embeds=inputs_embeds,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
start_positions=start_positions,
end_positions=end_positions,
training=training,
kwargs_call=kwargs,
)
outputs = self.bert(
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
token_type_ids=inputs["token_type_ids"],
position_ids=inputs["position_ids"],
head_mask=inputs["head_mask"],
inputs_embeds=inputs["inputs_embeds"],
output_attentions=inputs["output_attentions"],
output_hidden_states=inputs["output_hidden_states"],
return_dict=inputs["return_dict"],
training=inputs["training"],
)
sequence_output = outputs[0]
# You also have to add it here
my_logits = self.my_dense(inputs=sequence_output)
logits = self.qa_outputs(inputs=my_logits)
start_logits, end_logits = tf.split(value=logits, num_or_size_splits=2, axis=-1)
start_logits = tf.squeeze(input=start_logits, axis=-1)
end_logits = tf.squeeze(input=end_logits, axis=-1)
loss = None
if inputs["start_positions"] is not None and inputs["end_positions"] is not None:
labels = {"start_position": inputs["start_positions"]}
labels["end_position"] = inputs["end_positions"]
loss = self.hf_compute_loss(labels=labels, logits=(start_logits, end_logits))
if not inputs["return_dict"]:
output = (start_logits, end_logits) + outputs[2:]
return ((loss,) + output) if loss is not None else output
return TFQuestionAnsweringModelOutput(
loss=loss,
start_logits=start_logits,
end_logits=end_logits,
hidden_states=outputs.hidden_states,
attentions=outputs.attentions,
)
def serving_output(self, output: TFQuestionAnsweringModelOutput) -> TFQuestionAnsweringModelOutput:
hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None
attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None
return TFQuestionAnsweringModelOutput(
start_logits=output.start_logits, end_logits=output.end_logits, hidden_states=hs, attentions=attns
)

Expected to see 3 array(s), but instead got the following list of 1 arrays:

I am trying to train a triple loss model using a fit_generator. it requires three input and no output. so i have a function that generates hard triplets. the output from the triplets generator has a shape of (3,5,279) which is 3 inputs(anchor,positive and negative) for 5 batches and a total of 279 features. When i run the fit_generator it throws this error that "the list of Numpy arrays that you are passing to your model is not the size the model expected. Expected to see 3 array(s), but instead got the following list of 1 arrays" meanwhile i have passed a list of three arrays. the code is below. it works when i use the fit, however, i want to always call the generator function to generate my triplets as my batches. thanks in advance..this has taken me three days
def load_data():
path = "arrhythmia_data.txt"
f = open( path, "r")
data = []
#remove line breaker, comma separate and store in array
for line in f:
line = line.replace('\n','').replace('?','0')
line = line.split(",")
data.append(line)
f.close()
data = np.array(data).astype(np.float64)
#print(data.shape)
#create the class labels for input data
Y_train = data[:,-1:]
train = data[:,:-1]
normaliser = preprocessing.MinMaxScaler()
train = normaliser.fit_transform(train)
val = train[320:,:]
train = train[:320,:]
#create one hot encoding of the class labels of the data and separate them into train and test data
lb = LabelBinarizer()
encode = lb.fit_transform(Y_train)
nb_classes = int(len(encode[0]))
#one_hot_labels = keras.utils.to_categorical(labels, num_classes=10) this could also be used for one hot encoding
Y_val_e = encode[320:,:]
Y_train_e = encode[:320,:]
print(Y_train_e[0])
print(np.argmax(Y_train_e[0]))
val_in = []
train_in = []
#grouping and sorting the input data based on label id or name
for n in range(nb_classes):
images_class_n = np.asarray([row for idx,row in enumerate(train) if np.argmax(Y_train_e[idx])==n])
train_in.append(images_class_n)
images_class_n = np.asarray([row for idx,row in enumerate(val) if np.argmax(Y_val_e[idx])==n])
val_in.append(images_class_n)
#print(train_in[0].shape)
return train_in,val_in,Y_train_e,Y_val_e,nb_classes
train_in,val,Y_train,Y_val,nb_classes = load_data()
input_shape = (train_in[0].shape[1],)
def build_network(input_shape , embeddingsize):
'''
Define the neural network to learn image similarity
Input :
input_shape : shape of input images
embeddingsize : vectorsize used to encode our picture
'''
#in_ = Input(train.shape)
net = Sequential()
net.add(Dense(128, activation='relu', input_shape=input_shape))
net.add(Dense(128, activation='relu'))
net.add(Dense(256, activation='relu'))
net.add(Dense(4096, activation='sigmoid'))
net.add(Dense(embeddingsize, activation= None))
#Force the encoding to live on the d-dimentional hypershpere
net.add(Lambda(lambda x: K.l2_normalize(x,axis=-1)))
return net
class TripletLossLayer(Layer):
def __init__(self, alpha, **kwargs):
self.alpha = alpha
super(TripletLossLayer, self).__init__(**kwargs)
def triplet_loss(self, inputs):
anchor, positive, negative = inputs
p_dist = K.sum(K.square(anchor-positive), axis=-1)
n_dist = K.sum(K.square(anchor-negative), axis=-1)
return K.sum(K.maximum(p_dist - n_dist + self.alpha, 0), axis=0)
def call(self, inputs):
loss = self.triplet_loss(inputs)
self.add_loss(loss)
return loss
def build_model(input_shape, network, margin=0.2):
'''
Define the Keras Model for training
Input :
input_shape : shape of input images
network : Neural network to train outputing embeddings
margin : minimal distance between Anchor-Positive and Anchor-Negative for the lossfunction (alpha)
'''
# Define the tensors for the three input images
anchor_input = Input(input_shape, name="anchor_input")
positive_input = Input(input_shape, name="positive_input")
negative_input = Input(input_shape, name="negative_input")
# Generate the encodings (feature vectors) for the three images
encoded_a = network(anchor_input)
encoded_p = network(positive_input)
encoded_n = network(negative_input)
#TripletLoss Layer
loss_layer = TripletLossLayer(alpha=margin,name='triplet_loss_layer')([encoded_a,encoded_p,encoded_n])
# Connect the inputs with the outputs
network_train = Model(inputs=[anchor_input,positive_input,negative_input],outputs=loss_layer)
# return the model
return network_train
def get_batch_random(batch_size,s="train"):
# initialize result
triplets=[np.zeros((batch_size,m)) for i in range(3)]
for i in range(batch_size):
#Pick one random class for anchor
anchor_class = np.random.randint(0, nb_classes)
nb_sample_available_for_class_AP = X[anchor_class].shape[0]
#Pick two different random pics for this class => A and P. You can use same anchor as P if there is one one element for anchor
if nb_sample_available_for_class_AP<=1:
continue
[idx_A,idx_P] = np.random.choice(nb_sample_available_for_class_AP,size=2 ,replace=False)
#Pick another class for N, different from anchor_class
negative_class = (anchor_class + np.random.randint(1,nb_classes)) % nb_classes
nb_sample_available_for_class_N = X[negative_class].shape[0]
#Pick a random pic for this negative class => N
idx_N = np.random.randint(0, nb_sample_available_for_class_N)
triplets[0][i,:] = X[anchor_class][idx_A,:]
triplets[1][i,:] = X[anchor_class][idx_P,:]
triplets[2][i,:] = X[negative_class][idx_N,:]
return np.array(triplets)
def get_batch_hard(draw_batch_size,hard_batchs_size,norm_batchs_size,network,s="train"):
if s == 'train':
X = train_in
else:
X = val
#m, features = X[0].shape
#while True:
#Step 1 : pick a random batch to study
studybatch = get_batch_random(draw_batch_size,X)
#Step 2 : compute the loss with current network : d(A,P)-d(A,N). The alpha parameter here is omited here since we want only to order them
studybatchloss = np.zeros((draw_batch_size))
#Compute embeddings for anchors, positive and negatives
A = network.predict(studybatch[0])
P = network.predict(studybatch[1])
N = network.predict(studybatch[2])
#Compute d(A,P)-d(A,N)
studybatchloss = np.sum(np.square(A-P),axis=1) - np.sum(np.square(A-N),axis=1)
#Sort by distance (high distance first) and take the
selection = np.argsort(studybatchloss)[::-1][:hard_batchs_size]
#Draw other random samples from the batch
selection2 = np.random.choice(np.delete(np.arange(draw_batch_size),selection),norm_batchs_size,replace=False)
selection = np.append(selection,selection2)
triplets = [studybatch[0][selection,:], studybatch[1][selection,:],studybatch[2][selection,:]]
triplets = triplets.reshape(triplets.shape[0],triplets.shape[1],triplets.shape[2])
yield triplets
network = build_network(input_shape,embeddingsize=10)
hard = get_batch_hard(5,4,1,network,s="train")
network_train = build_model(input_shape,network)
optimizer = Adam(lr = 0.00006)
network_train.compile(loss=None,optimizer=optimizer)
#this works
#history = network_train.fit(hard,epochs=100,steps_per_epoch=1, verbose=2)
history = network_train.fit_generator(hard,epochs=10,steps_per_epoch=16, verbose=2)
# error:: the list of Numpy arrays that you are passing to your model is not the size the model
expected. Expected to see 3 array(s), but instead got the following list of 1 arrays:
I think that's beacause in your generator you are yielding the 3 inputs array in one list, you need to yield the 3 arrays independently:
triplet_1 = studybatch[0][selection,:]
triplet_2 = studybatch[1][selection,:]
triplet_3 = studybatch[2][selection,:]
yield [triplet_1, triplet_2, triplet_3]

Apply same Keras Tensorflow preprocessing transform to several images

I'm currently building my own custom generator with data augmentation. I have to apply the same transformation to two images, the train one and the target one, but functions only works one by one. Any ideas how to apply the exact transformation in both images?
This is one function I'm trying:
tf.keras.preprocessing.image.random_zoom
The code for the generator:
def generator(data, batch_size = 32, data_augmentation = False):
batch_number = 0
while True:
original_train = []
original_target = []
train = []
target = []
if batch_number * batch_size >= data.shape[0]:
batch_number = 0
for x in train_pd.values[batch_number * batch_size: (batch_number + 1) * batch_size ]:
train_image = load_image_as_array(train_path+'images/'+x[0]+'.png').reshape(101,101,1)
target_image = load_image_as_array(train_path+'masks/'+x[0]+'.png').reshape(101,101,1)
original_train.append(train_image)
original_target.append(target_image)
if data_augmentation:
train_image = random_zoom(train_image, (.25,.25))
target_image = random_zoom(target_image, (.25,.25))
train.append(train_image)
target.append(target_image)
batch_number += 1
yield np.array(train), np.array(target), np.array(original_train), np.array(original_target)
Ok. It seems that it applies the exact transformation besides you call the function two times.

How to use previous output and hidden states from LSTM for the attention mechanism?

I am currently trying to code the attention mechanism from this paper: "Effective Approaches to Attention-based Neural Machine Translation", Luong, Pham, Manning (2015). (I use global attention with the dot score).
However, I am unsure on how to input the hidden and output states from the lstm decode. The issue is that the input of the lstm decoder at time t depends on quantities that I need to compute using the output and hidden states from t-1.
Here is the relevant part of the code:
with tf.variable_scope('data'):
prob = tf.placeholder_with_default(1.0, shape=())
X_or = tf.placeholder(shape = [batch_size, timesteps_1, num_input], dtype = tf.float32, name = "input")
X = tf.unstack(X_or, timesteps_1, 1)
y = tf.placeholder(shape = [window_size,1], dtype = tf.float32, name = "label_annotation")
logits = tf.zeros((1,1), tf.float32)
with tf.variable_scope('lstm_cell_encoder'):
rnn_layers = [tf.nn.rnn_cell.LSTMCell(size) for size in [hidden_size, hidden_size]]
multi_rnn_cell = tf.nn.rnn_cell.MultiRNNCell(rnn_layers)
lstm_outputs, lstm_state = tf.contrib.rnn.static_rnn(cell=multi_rnn_cell,inputs=X,dtype=tf.float32)
concat_lstm_outputs = tf.stack(tf.squeeze(lstm_outputs))
last_encoder_state = lstm_state[-1]
with tf.variable_scope('lstm_cell_decoder'):
initial_input = tf.unstack(tf.zeros(shape=(1,1,hidden_size2)))
rnn_decoder_cell = tf.nn.rnn_cell.LSTMCell(hidden_size, state_is_tuple = True)
# Compute the hidden and output of h_1
for index in range(window_size):
output_decoder, state_decoder = tf.nn.static_rnn(rnn_decoder_cell, initial_input, initial_state=last_encoder_state, dtype=tf.float32)
# Compute the score for source output vector
scores = tf.matmul(concat_lstm_outputs, tf.reshape(output_decoder[-1],(hidden_size,1)))
attention_coef = tf.nn.softmax(scores)
context_vector = tf.reduce_sum(tf.multiply(concat_lstm_outputs, tf.reshape(attention_coef, (window_size, 1))),0)
context_vector = tf.reshape(context_vector, (1,hidden_size))
# compute the tilda hidden state \tilde{h}_t=tanh(W[c_t, h_t]+b_t)
concat_context = tf.concat([context_vector, output_decoder[-1]], axis = 1)
W_tilde = tf.Variable(tf.random_normal(shape = [hidden_size*2, hidden_size2], stddev = 0.1), name = "weights_tilde", trainable = True)
b_tilde = tf.Variable(tf.zeros([1, hidden_size2]), name="bias_tilde", trainable = True)
hidden_tilde = tf.nn.tanh(tf.matmul(concat_context, W_tilde)+b_tilde) # hidden_tilde is [1*64]
# update for next time step
initial_input = tf.unstack(tf.reshape(hidden_tilde, (1,1,hidden_size2)))
last_encoder_state = state_decoder
# predict the target
W_target = tf.Variable(tf.random_normal(shape = [hidden_size2, 1], stddev = 0.1), name = "weights_target", trainable = True)
logit = tf.matmul(hidden_tilde, W_target)
logits = tf.concat([logits, logit], axis = 0)
logits = logits[1:]
The part inside the loop is what I am unsure of. Does tensorflow remember the computational graph when I overwrite the variable "initial_input" and "last_encoder_state"?
I think your model will be much simplified if you use tf.contrib.seq2seq.AttentionWrapper with one of implementations: BahdanauAttention or LuongAttention.
This way it'll be possible to wire the attention vector on a cell level, so that cell output is already after attention applied. Example from the seq2seq tutorial:
cell = LSTMCell(512)
attention_mechanism = tf.contrib.seq2seq.LuongAttention(512, encoder_outputs)
attn_cell = tf.contrib.seq2seq.AttentionWrapper(cell, attention_mechanism, attention_size=256)
Note that this way you won't need a loop of window_size, because tf.nn.static_rnn or tf.nn.dynamic_rnn will instantiate the cells wrapped with attention.
Regarding your question: you should distinguish python variables and tensorflow graph nodes: you can assign last_encoder_state to a different tensor, the original graph node won't change because of this. This is flexible, but can be also misleading in the result network - you might think that you connect an LSTM to one tensor, but it's actually the other. In general, you shouldn't do that.

Can embedding_rnn_seq2seq function return all states?

I am playing Seq2Seq model use embedding_rnn_seq2seq function ,
i read document say embedding_rnn_seq2seq return outputs and state that in each time-step ,
but i try to get state only can get one step
here is my model
seq_length = 100
batch_size = 128
vocab_size = 12
memory_dim = 100
enc_inp = [tf.placeholder(tf.int32, shape=(None,), name="inp%i" % t) for t in range(seq_length)]
labels = [tf.placeholder(tf.int32, shape=(None,), name="labels%i" % t) for t in range(seq_length)]
dec_inp = ([tf.zeros_like(labels[0], dtype=np.int32, name="GO")] + labels[:-1])
weights = [tf.ones_like(labels_t, dtype=tf.float32) for labels_t in labels]
cell = rnn_cell.GRUCell(memory_dim)
dec_outputs, dec_memory = seq2seq.embedding_rnn_seq2seq(enc_inp,dec_inp,cell,vocab_size,vocab_size,vocab_size)
loss = seq2seq.sequence_loss(dec_outputs, labels, weights, vocab_size)
try to get state (dec_memory)
dec_memory_batch = sess.run(dec_memory , feed_dict)
only return a one-step memory_dim size vector , maybe is last step state
So , do anyone have some advice ?