Tensorflow: LSTM with moving average of weights of another LSTM - tensorflow

I would like to have an LSTM in tensorflow whose weights are the exponential moving average of the weights of another LSTM. So basically I have this code with some input placeholder and some initial state placeholder:
def test_lstm(input_ph, init_ph, scope):
cell = tf.nn.rnn_cell.LSTMCell(128, use_peepholes=True)
input_ph = tf.transpose(input_ph, [1, 0, 2])
input_list = tf.unpack(input_ph)
with tf.variable_scope(scope) as vs:
outputs, states = tf.nn.rnn(cell, input_list, initial_state=init_ph)
theta = [v for v in tf.all_variables() if v.name.startswith(vs.name)]
return outputs, theta
lstm_1, theta_1 = test_lstm(input_1, state_init_1, scope="lstm_1")
What I would like to do now is something similar along these lines (which don't actually work because the exponential moving average puts the tag "ema" behind the variable name of the weights and they do not appear in variable scope because they were not created with tf.get_variable):
ema = tf.train.ExponentialMovingAverage(decay=1-self.tau, name="ema")
with tf.variable_scope("lstm_2"):
maintain_averages_theta_1 = ema.apply(theta_1)
theta_2_1 = [ema.average(x) for x in theta_1]
lstm_2 , theta_2_2 = test_lstm(input_2, state_init_2, scope="lstm_2"
where eventually theta_2_1 would be equal to theta_2_2 (or throw an exception because the variables already exist).

Change (3/May/2018): I added one line apart from the old answer. The old one itself is self-sufficient to this question, but it real practice we initialize 'target network' as the same value to the 'behavioural network'. I added this point. Search the line having 'Change0': You need to do this only once right after when the target network's parameters are just initialized. Without 'Change0', you first round of gradient-based update would become nasty since Q_target and Q_behaviour are not correlated while you need both to be somewhat related ex) TD = r + gamma*Q_target - Q_behaviour for the update.
Seems a bit late but hope this helps.
The critical problem of TF-RNN series possess is that we cannot directly designate the variables for RNNs unlike plain feedforward or convolutional NNs, so that we can't do simple work - get EMAed vars and plug into the network.
Let's go into the real deal(I attached a practice code to look-up for this, so please refer EMA_LSTM.py).
Shall we say, there is a network function containing LSTM:
def network(in_x,in_h):
# Input: 'in_x' is the input sequence, 'in_h' is the initial hidden cell(c,m)
# Output: 'hidden_outputs' is the output sequence(c-sequence), 'net' is the list of parameters used in this network function
cell = tf.nn.rnn_cell.BasicLSTMCell(3, state_is_tuple=True)
in_h = tf.nn.rnn_cell.LSTMStateTuple(in_h[0], in_h[1])
hidden_outputs, last_tuple = tf.nn.dynamic_rnn(cell, in_x, dtype=tf.float32, initial_state=in_h)
net = [v for v in tf.trainable_variables() if tf.contrib.framework.get_name_scope() in v.name]
return hidden_outputs, net
Then you declare tf.placeholder for the necessary inputs, which are:
in_x = tf.placeholder("float", [None,None,6])
in_c = tf.placeholder("float", [None,3])
in_m = tf.placeholder("float", [None,3])
in_h = (in_c, in_m)
Lastly, we run a session that proceeds the network() function with specified inputs:
init_cORm = np.zeros(shape=(1,3))
input = np.ones(shape=(1,1,6))
print '========================new-1(beh)=============================='
with tf.Session() as sess:
with tf.variable_scope('beh', reuse=False) as beh:
result, net1 = network(in_x,in_h)
sess.run(tf.global_variables_initializer())
list = sess.run([result, in_x, in_h, net1], feed_dict={
in_x : input,
in_c : init_cORm,
in_m : init_cORm
})
print 'result:', list[0]
print 'in_x:' , list[1]
print 'in_h:', list[2]
print 'net1:', list[3][0][0][:4]
Now, we are going to make the var_list called 'net4' that contains the ExponentialMovingAverage(EMA)-ed values of 'net1', as in below with original 'net1' being first assigned in beh session above and newly assigned in below by adding 1. for each element:
ema = tf.train.ExponentialMovingAverage(decay=0.5)
target_update_op = ema.apply(net1)
init_new_vars_op = tf.initialize_variables(var_list=[v for v in tf.global_variables() if 'ExponentialMovingAverage' in v.name]) # 'initialize_variables' will be replaced with 'variables_initializer' in 2017
sess.run(init_new_vars_op)
sess.run([param4.assign(param1.eval()) for param4, param1 in zip(net4,net1)]) # Change0
len_net1 = len(net1)
net1_ema = [[] for i in range(len_net1)]
for i in range(len_net1):
sess.run(net1[i].assign(1. + net1[i]))
sess.run(target_update_op)
Note that
we only initialised(by declaring 'init_new_vars_op' then running the declaration job) the variables of their name containing 'ExponentialMovingAverage', if not the variables in net1 will also be newly initialised.
'net1' is newly assigned with +1 for every elements of the variables in 'net1'. If an element of 'net1' was -0.5 and is now 0.5 by +1, then we want to have 'net4' as 0. when the EMA decay rate is 0.5
Finally, we run the EMA job with 'sess.run(target_update_op)'
Eventually, we declare 'net4' first with the 'network()' function and then assign&run the EMA(net1) values into 'net4'. When you run 'sess.run(result)', then it will be the one with EMA(net1)-ed variables.
with tf.variable_scope('tare', reuse=False) as tare:
result, net4 = network(in_x,in_h)
len_net4 = len(net4)
target_assign = [[] for i in range(len_net4)]
for i in range(len_net4):
target_assign[i] = net4[i].assign(ema.average(net1[i]))
sess.run(target_assign[i].op)
list = sess.run([result, in_x, in_h, net4], feed_dict={
in_x : input,
in_c : init_cORm,
in_m : init_cORm
})
What's happened in here? You just indirectly declares the LSTM variables as 'net4' in the 'network()' function. Then in the for loop, we points out that 'net4' is actually EMA of 'net1' and 'net1+1.' . Lastly, with the net4 specified what to do with(via 'network()') and what values it takes(via 'for loop of .assign(ema.average()) to itself'), you run the process.
It is somewhat counter-intuitive that we declare the 'result' first and specifies the value of parameters second. However, that's the nature of TF what they exactly look for, as it is always logical to set variables, processes, and their relation first then to assign values, then runs the processes.
Lastly, few things to go further for the real machine learning codes:
In here, I just second assigned 'net1' with 'net1 + 1.'. In real case, this '+1.' step is where you 'sess.run()'(after you 'declare' in somewhere) your optimiser. So every-time after you 'sess.run(optimiser)', 'sess.run(target_update_op)' and then'sess.run(target_assign[i].op)' should follow to update your 'net4' along the EMA of 'net1'. Conceretly, you can do this job with different order like below:
ema = tf.train.ExponentialMovingAverage(decay=0.5)
target_update_op = ema.apply(net1)
with tf.variable_scope('tare', reuse=False) as tare:
result, net4 = network(in_x,in_h)
len_net4 = len(net4)
target_assign = [[] for i in range(len_net4)]
for i in range(len_net4):
target_assign[i] = net4[i].assign(ema.average(net1[i]))
init_new_vars_op = tf.initialize_variables(var_list=[v for v in tf.global_variables() if 'ExponentialMovingAverage' in v.name]) # 'initialize_variables' will be replaced with 'variables_initializer' in 2017
sess.run(init_new_vars_op)
sess.run([param4.assign(param1.eval()) for param4, param1 in zip(net4,net1)]) # Change0
Lastly, be aware that you must sess.run(ema.apply(net)) right after the net changes, and then sess.run(net_ema.assign(ema.average(net)). W/O .apply, net_ema will not get assigned with averaged value
len_net1 = len(net1)
net1_ema = [[] for i in range(len_net1)]
for i in range(len_net1):
sess.run(net1[i].assign(1. + net1[i]))
sess.run(target_update_op)
for i in range(len_net4):
sess.run(target_assign[i].op)
list = sess.run([result, in_x, in_h, net4], feed_dict={
in_x : input,
in_c : init_cORm,
in_m : init_cORm
})

Related

consistent forward / backward pass with tensorflow dropout

For the reinforcement learning one usually applies forward pass of the neural network for each step of the episode in order to calculate policy. Afterwards one could calculate parameter gradients using backpropagation. Simplified implementation of my network looks like this:
class AC_Network(object):
def __init__(self, s_size, a_size, scope, trainer, parameters_net):
with tf.variable_scope(scope):
self.is_training = tf.placeholder(shape=[], dtype=tf.bool)
self.inputs = tf.placeholder(shape=[None, s_size], dtype=tf.float32)
# (...)
layer = slim.fully_connected(self.inputs,
layer_size,
activation_fn=tf.nn.relu,
biases_initializer=None)
layer = tf.contrib.layers.dropout(inputs=layer, keep_prob=parameters_net["dropout_keep_prob"],
is_training=self.is_training)
self.policy = slim.fully_connected(layer, a_size,
activation_fn=tf.nn.softmax,
biases_initializer=None)
self.actions = tf.placeholder(shape=[None], dtype=tf.int32)
self.advantages = tf.placeholder(shape=[None], dtype=tf.float32)
actions_onehot = tf.one_hot(self.actions, a_size, dtype=tf.float32)
responsible_outputs = tf.reduce_sum(self.policy * actions_onehot, [1])
self.policy_loss = - policy_loss_multiplier * tf.reduce_mean(tf.log(responsible_outputs) * self.advantages)
local_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope)
self.gradients = tf.gradients(self.policy_loss, local_vars)
Now during training I will fist rollout the episode by consecutive forward passes (again, simplified version):
s = self.local_env.reset() # list of input variables for the first step
while done == False:
a_dist = sess.run([self.policy],
feed_dict = {self.local_AC.inputs: [s],
self.is_training: True})
a = np.argmax(a_dist)
s, r, done, extra_stat = self.local_env.step(a)
# (...)
and in the end I will calculate gradients by backward pass:
p_l, grad = sess.run([self.policy_loss,
self.gradients],
feed_dict={self.inputs: np.vstack(comb_observations),
self.is_training: True,
self.actions: np.hstack(comb_actions),})
(please note that I could have made a mistake somewhere above trying to remove as much as possible of the original code irrelevant to the issue in question)
So finally the question: Is there a way of ensuring that all the consecutive calls to the sess.run() will generate the same dropout structure? Ideally I would like to have exactly the same dropout structure within each episode and only change it between episodes. Things seem to work well as they are but I continue to wonder.

RNN Slow-down phenomenon of Tensorflow

I found a peculiar property of lstm cell(not limited to lstm but I only examined with this) of tensorflow which has not been reported as far as I know.
I don't know whether it actually has, so I left this post in SO. Below is a toy code for this problem:
import tensorflow as tf
import numpy as np
import time
def network(input_list):
input,init_hidden_c,init_hidden_m = input_list
cell = tf.nn.rnn_cell.BasicLSTMCell(256, state_is_tuple=True)
init_hidden = tf.nn.rnn_cell.LSTMStateTuple(init_hidden_c, init_hidden_m)
states, hidden_cm = tf.nn.dynamic_rnn(cell, input, dtype=tf.float32, initial_state=init_hidden)
net = [v for v in tf.trainable_variables()]
return states, hidden_cm, net
def action(x, h_c, h_m):
t0 = time.time()
outputs, output_h = sess.run([rnn_states[:,-1:,:], rnn_hidden_cm], feed_dict={
rnn_input:x,
rnn_init_hidden_c: h_c,
rnn_init_hidden_m: h_m
})
dt = time.time() - t0
return outputs, output_h, dt
rnn_input = tf.placeholder("float", [None, None, 512])
rnn_init_hidden_c = tf.placeholder("float", [None,256])
rnn_init_hidden_m = tf.placeholder("float", [None,256])
rnn_input_list = [rnn_input, rnn_init_hidden_c, rnn_init_hidden_m]
rnn_states, rnn_hidden_cm, rnn_net = network(rnn_input_list)
feed_input = np.random.uniform(low=-1.,high=1.,size=(1,1,512))
feed_init_hidden_c = np.zeros(shape=(1,256))
feed_init_hidden_m = np.zeros(shape=(1,256))
sess = tf.Session()
sess.run(tf.global_variables_initializer())
for i in range(10000):
_, output_hidden_cm, deltat = action(feed_input, feed_init_hidden_c, feed_init_hidden_m)
if i % 10 == 0:
print 'Running time: ' + str(deltat)
(feed_init_hidden_c, feed_init_hidden_m) = output_hidden_cm
feed_input = np.random.uniform(low=-1.,high=1.,size=(1,1,512))
[Not important]What this code does is to generate an output from 'network()' function containing LSTM where the input's temporal dimension is 1, so output's is also 1, and pull in&out initial state for each step of running.
[Important] Looking the 'sess.run()' part. For some reasons in my real code, I happened to put [:,-1:,:] for 'rnn_states'. What is happening is then the time spent for each 'sess.run()' increases. For some inspection by my own, I found this slow down stems from that [:,-1:,:]. I just wanted to get the output at the last time step. If you do 'outputs, output_h = sess.run([rnn_states, rnn_hidden_cm], feed_dict{~' w/o [:,-1:,:] and take 'last_output = outputs[:,-1:,:]' after the 'sess.run()', then the slow down does not occur.
I do not know why this exponential increment of time happens with that [:,-1:,:] running. Is this the nature of tensorflow hasn't been documented but particularly slows down(may be adding more graph by its own?)?
Thank you, and hope this mistake not happen for other users by this post.
I encountered the same problem, with TensorFlow slowing down for each iteration I ran it, and found this question while trying to debug it. Here's a short description of my situation and how I solved it for future reference. Hopefully it can point someone in the right direction and save them some time.
In my case the problem was mainly that I didn't make use of feed_dict to supply the network state when executing sess.run(). Instead I redeclared outputs, final_state and prediction every iteration. The answer at https://github.com/tensorflow/tensorflow/issues/1439#issuecomment-194405649 made me realize how stupid that was... I was constantly creating new graph nodes in every iteration, making it all slower and slower. The problematic code looked something like this:
# defining the network
lstm_layer = rnn.BasicLSTMCell(num_units, forget_bias=1)
outputs, final_state = rnn.static_rnn(lstm_layer, input, initial_state=rnn_state, dtype='float32')
prediction = tf.nn.softmax(tf.matmul(outputs[-1], out_weights)+out_bias)
for input_data in data_seq:
# redeclaring, stupid stupid...
outputs, final_state = rnn.static_rnn(lstm_layer, input, initial_state=rnn_state, dtype='float32')
prediction = tf.nn.softmax(tf.matmul(outputs[-1], out_weights)+out_bias)
p, rnn_state = sess.run((prediction, final_state), feed_dict={x: input_data})
The solution was of course to only declare the nodes once in the beginning, and supply the new data with feed_dict. The code went from being half slow (> 15 ms in the beginning) and becoming slower for every iteration, to execute every iteration in around 1 ms. My new code looks something like this:
out_weights = tf.Variable(tf.random_normal([num_units, n_classes]), name="out_weights")
out_bias = tf.Variable(tf.random_normal([n_classes]), name="out_bias")
# placeholder for the network state
state_placeholder = tf.placeholder(tf.float32, [2, 1, num_units])
rnn_state = tf.nn.rnn_cell.LSTMStateTuple(state_placeholder[0], state_placeholder[1])
x = tf.placeholder('float', [None, 1, n_input])
input = tf.unstack(x, 1, 1)
# defining the network
lstm_layer = rnn.BasicLSTMCell(num_units, forget_bias=1)
outputs, final_state = rnn.static_rnn(lstm_layer, input, initial_state=rnn_state, dtype='float32')
prediction = tf.nn.softmax(tf.matmul(outputs[-1], out_weights)+out_bias)
# actual network state, which we input with feed_dict
_rnn_state = tf.nn.rnn_cell.LSTMStateTuple(np.zeros((1, num_units), dtype='float32'), np.zeros((1, num_units), dtype='float32'))
it = 0
for input_data in data_seq:
encl_input = [[input_data]]
p, _rnn_state = sess.run((prediction, final_state), feed_dict={x: encl_input, rnn_state: _rnn_state})
print("{} - {}".format(it, p))
it += 1
Moving the declaration out from the for loop also got rid of the problem which the OP sdr2002 had, doing a slice outputs[-1] in sess.run() inside the for loop.
As mentioned above, no sliced output for 'sess.run()' is much appreciated for this case.
def action(x, h_c, h_m):
t0 = time.time()
outputs, output_h = sess.run([rnn_states, rnn_hidden_cm], feed_dict={
rnn_input:x,
rnn_init_hidden_c: h_c,
rnn_init_hidden_m: h_m
})
outputs = outputs[:,-1:,:]
dt = time.time() - t0
return outputs, output_h, dt

Tensorflow - running total

How can I add the number 5 after every iteration of the loop?
I want to do something like this:
weight = 0.225
for i in range(10):
weight += 5
print (weight)
Here is how I am trying in tensorflow but it never updates the weight
import tensorflow as tf
def dummy(x):
weights['h0'] = tf.add(weights['h0'], 5)
res = tf.add(weights['h0'], x)
return res
# build computational graph
a = tf.placeholder('float', None)
d = dummy(a)
weights = {
'h0': tf.Variable(tf.random_normal([1]))
}
# initialize variables
init = tf.global_variables_initializer()
# create session and run the graph
with tf.Session() as sess:
sess.run(init)
for i in range(10):
print (sess.run(d, feed_dict={a: [2]}))
# close session
sess.close()
There's an operation explicitly created for adding a value and assigning the result back to the input node: tf.assign_add
You should use it instead of tf.assing + tf.add.
Also, it's more important that you understand why you previous code won't work.
weights['h0'] = tf.add(weights['h0'], 5)
res = tf.add(weights['h0'], x)
At the fist line, you're defining a node add, whose inputs are weights['h0'] and 5 and you're assigning this node to a python variable weights['h0'].
Now, thus, weights['h0'] is a python variable holding a tensorflow node.
In the next line, you're defining another add node, between the previous node and x, and you return this node.
When the graph is evaluated, you evaluate the node pointed by res, that force the evaluation of the previous node (because res is a function of the node holded by weights['h0']).
The problem is the that your assignment at line 1 is a python assignment and not a tensorflow assignment.
Thus that assign operation is executed only in the python environment but it has no defined an assign node into the tensorflow graph.
P.S: when you use with you're defining a context manager that handles the closing operations for you. You can thus remove sess.close() because is executed automatically when you exit from that context
Apparently there is an assign operator
https://www.tensorflow.org/api_docs/python/tf/assign
weights['h0'] = tf.assign(weights['h0'], tf.add(weights['h0'], 5))

How to use maxout activation function in tensorflow?

I want to use maxout activation function in tensorflow, but I don't know which function should use.
I sent a pull request for maxout, here is the link:
https://github.com/tensorflow/tensorflow/pull/5528
Code is as follows:
def maxout(inputs, num_units, axis=None):
shape = inputs.get_shape().as_list()
if axis is None:
# Assume that channel is the last dimension
axis = -1
num_channels = shape[axis]
if num_channels % num_units:
raise ValueError('number of features({}) is not a multiple of num_units({})'
.format(num_channels, num_units))
shape[axis] = -1
shape += [num_channels // num_units]
outputs = tf.reduce_max(tf.reshape(inputs, shape), -1, keep_dims=False)
return outputs
Here is how it works:
I don't think there is a maxout activation but there is nothing stopping yourself from making it yourself. You could do something like the following.
with tf.variable_scope('maxout'):
layer_input = ...
layer_output = None
for i in range(n_maxouts):
W = tf.get_variable('W_%d' % d, (n_input, n_output))
b = tf.get_variable('b_%d' % i, (n_output,))
y = tf.matmul(layer_input, W) + b
if layer_output is None:
layer_output = y
else:
layer_output = tf.maximum(layer_output, y)
Note that this is code I just wrote in my browser so there may be syntax errors but you should get the general idea. You simply perform a number of linear transforms and take the maximum across all the transforms.
How about this code?
This seems to work in my test.
def max_out(input_tensor,output_size):
shape = input_tensor.get_shape().as_list()
if shape[1] % output_size == 0:
return tf.transpose(tf.reduce_max(tf.split(input_tensor,output_size,1),axis=2))
else:
raise ValueError("Output size or input tensor size is not fine. Please check it. Reminder need be zero.")
I refer the diagram in the following page.
From version 1.4 on you can use tf.contrib.layers.maxout.
Maxout is a layer such that it calculates N*M output for a N*1 input, and then it returns the maximum value across the column, i.e., the final output has shape N*1 as well. Basically it uses multiple linear fittings to mimic a complex function.

Using TensorArrays in the context of a while_loop to accumulate values

Below I have an implementation of a Tensorflow RNN Cell, designed to emulate Alex Graves' algorithm ACT in this paper: http://arxiv.org/abs/1603.08983.
At a single timestep in the sequence called via rnn.rnn(with a static sequence_length parameter, so the rnn is unrolled dynamically - I am using a fixed batch size of 20), we recursively call ACTStep, producing outputs of size(1,200) where the hidden dimension of the RNN cell is 200 and we have a batch size of 1.
Using the while loop in Tensorflow, we iterate until the accumulated halting probability is high enough. All of this works reasonably smoothly, but I am having problems accumulating states, probabilities and outputs within the while loop, which we need to do in order to create weighted combinations of these as the final cell output/state.
I have tried using a simple list, as below, but this fails when the graph is compiled as the outputs are not in the same frame(is it possible to use the "switch" function in control_flow_ops to forward the tensors to the point at which they are required, ie the add_n function just before we return the values?). I have also tried using the TensorArray structure, but I am finding this difficult to use as it seems to destroy shape information and replacing it manually hasn't worked. I also haven't been able to find much documentation on TensorArrays, presumably as they are, I imagine, mainly for internal TF use.
Any advice on how it might be possible to correctly accumulate the variables produced by ACTStep would be much appreciated.
class ACTCell(RNNCell):
"""An RNN cell implementing Graves' Adaptive Computation time algorithm"""
def __init__(self, num_units, cell, epsilon, max_computation):
self.one_minus_eps = tf.constant(1.0 - epsilon)
self._num_units = num_units
self.cell = cell
self.N = tf.constant(max_computation)
#property
def input_size(self):
return self._num_units
#property
def output_size(self):
return self._num_units
#property
def state_size(self):
return self._num_units
def __call__(self, inputs, state, scope=None):
with vs.variable_scope(scope or type(self).__name__):
# define within cell constants/ counters used to control while loop
prob = tf.get_variable("prob", [], tf.float32,tf.constant_initializer(0.0))
counter = tf.get_variable("counter", [],tf.float32,tf.constant_initializer(0.0))
tf.assign(prob,0.0)
tf.assign(counter, 0.0)
# the predicate for stopping the while loop. Tensorflow demands that we have
# all of the variables used in the while loop in the predicate.
pred = lambda prob,counter,state,input,\
acc_state,acc_output,acc_probs:\
tf.logical_and(tf.less(prob,self.one_minus_eps), tf.less(counter,self.N))
acc_probs = []
acc_outputs = []
acc_states = []
_,iterations,_,_,acc_states,acc_output,acc_probs = \
control_flow_ops.while_loop(pred,
self.ACTStep,
[prob,counter,state,input,acc_states,acc_outputs,acc_probs])
# TODO:fix last part of this, need to use the remainder.
# TODO: find a way to accumulate the regulariser
# here we take a weighted combination of the states and outputs
# to use as the actual output and state which is passed to the next timestep.
next_state = tf.add_n([tf.mul(x,y) for x,y in zip(acc_probs,acc_states)])
output = tf.add_n([tf.mul(x,y) for x,y in zip(acc_probs,acc_outputs)])
return output, next_state
def ACTStep(self,prob,counter,state,input, acc_states,acc_outputs,acc_probs):
output, new_state = rnn.rnn(self.cell, [input], state, scope=type(self.cell).__name__)
prob_w = tf.get_variable("prob_w", [self.cell.input_size,1])
prob_b = tf.get_variable("prob_b", [1])
p = tf.nn.sigmoid(tf.matmul(prob_w,new_state) + prob_b)
acc_states.append(new_state)
acc_outputs.append(output)
acc_probs.append(p)
return [tf.add(prob,p),tf.add(counter,1.0),new_state, input,acc_states,acc_outputs,acc_probs]
I'm going to preface this response that this is NOT a complete solution, but rather some commentary on how to improve your cell.
To start off, in your ACTStep function, you call rnn.rnn for one timestep (as defined by [input]. If you're doing a single timestep, it is probably more efficient to simple use the actual self.cell call function. You'll see this same mechanism used in tensorflow rnncell wrappers
You mentioned that you have tried using TensorArrays. Did you pack and unpack the tensorarrays appropriately? Here is a repo where you'll find under model.py the tensorarrays are packed and unpacked properly.
You also asked if there is a function in control_flow_ops that will require all the tensors to be accumulated. I think you are looking for tf.control_dependencies
You can list all of your output tensors operations in control_dependicies and that will require tensorflow to compute all tensors up into that point.
Also, it looks like your counter variable is trainable. Are you sure you want this to be the case? If you're adding plus one to your counter, that probably wouldn't yield the correct result. On the other hand, you could have purposely kept it trainable to differentiate it at the end for the ponder cost function.
Also I believe the Remainder function should be in your script:
remainder = 1.0 - tf.add_n(acc_probs[:-1])
#note that there is a -1 in the list as you do not want to grab the last probability
Here is my version of your code edited:
class ACTCell(RNNCell):
"""An RNN cell implementing Graves' Adaptive Computation time algorithm
Notes: https://www.evernote.com/shard/s189/sh/fd165646-b630-48b7-844c-86ad2f07fcda/c9ab960af967ef847097f21d94b0bff7
"""
def __init__(self, num_units, cell, max_computation = 5.0, epsilon = 0.01):
self.one_minus_eps = tf.constant(1.0 - epsilon) #episolon is 0.01 as found in the paper
self._num_units = num_units
self.cell = cell
self.N = tf.constant(max_computation)
#property
def input_size(self):
return self._num_units
#property
def output_size(self):
return self._num_units
#property
def state_size(self):
return self._num_units
def __call__(self, inputs, state, scope=None):
with vs.variable_scope(scope or type(self).__name__):
# define within cell constants/ counters used to control while loop
prob = tf.constant(0.0, shape = [batch_size])
counter = tf.constant(0.0, shape = [batch_size])
# the predicate for stopping the while loop. Tensorflow demands that we have
# all of the variables used in the while loop in the predicate.
pred = lambda prob,counter,state,input,acc_states,acc_output,acc_probs:\
tf.logical_and(tf.less(prob,self.one_minus_eps), tf.less(counter,self.N))
acc_probs, acc_outputs, acc_states = [], [], []
_,iterations,_,_,acc_states,acc_output,acc_probs = \
control_flow_ops.while_loop(
pred,
self.ACTStep, #looks like he purposely makes the while loop here
[prob, counter, state, input, acc_states, acc_outputs, acc_probs])
'''mean-field updates for states and outputs'''
next_state = tf.add_n([x*y for x,y in zip(acc_probs,acc_states)])
output = tf.add_n([x*y for x,y in zip(acc_probs,acc_outputs)])
remainder = 1.0 - tf.add_n(acc_probs[:-1]) #you take the last off to avoid a negative ponder cost #the problem here is we need to take the sum of all the remainders
tf.add_to_collection("ACT_remainder", remainder) #if this doesnt work then you can do self.list based upon timesteps
tf.add_to_collection("ACT_iterations", iterations)
return output, next_state
def ACTStep(self,prob, counter, state, input, acc_states, acc_outputs, acc_probs):
'''run rnn once'''
output, new_state = rnn.rnn(self.cell, [input], state, scope=type(self.cell).__name__)
prob_w = tf.get_variable("prob_w", [self.cell.input_size,1])
prob_b = tf.get_variable("prob_b", [1])
halting_probability = tf.nn.sigmoid(tf.matmul(prob_w,new_state) + prob_b)
acc_states.append(new_state)
acc_outputs.append(output)
acc_probs.append(halting_probability)
return [p + prob, counter + 1.0, new_state, input,acc_states,acc_outputs,acc_probs]
def PonderCostFunction(self, time_penalty = 0.01):
'''
note: ponder is completely different than probability and ponder = roe
the ponder cost function prohibits the rnn from cycling endlessly on each timestep when not much is needed
'''
n_iterations = tf.get_collection_ref("ACT_iterations")
remainder = tf.get_collection_ref("ACT_remainder")
return tf.reduce_sum(n_iterations + remainder) #completely different from probability
This is a complicated paper to implement that I have been working on myself. I wouldn't mind collaborating with you to get it done in Tensorflow. If you're interested, please add me at LeavesBreathe on Skype and we can go from there.