I am trying to build a model which is basically sequence to sequence model but i have a special encoder namely "Secondary Encoder".
Timesteps in Secondary Encoder = 300
this encoder has a special property, in essence it is a GRU, but at each timestep the hidden state produced by the GRUCell is needed to be altered, it is needed to be Added with another variable and then this combination(new hidden state) is passed on to the next GRUCell which uses this as initial_state........this thing repeated 300 times.
As 300 GRUCells are required (one for each time step) it is not feasible to hard code each of the 300 layers and create the model.
So, I need help to figure out how to write a loop to implement this thing in keras or maybe how to create a custom Layer (if this is a better choice).
what I thought (pseudocode) :-
here alpha is the variable that I was talking that i want to add
x = Input(shape=...)
encoder_cell = GRU(10,return_state=True)
init_state = xxxx //some value to give as initialiser to first GRU cell
for t in range(300):
_,hstate = encoder_cell(x[t],initial_state = init_state)
init_state = hstate + alpha
model = Model(inputs = x, outputs = init_state)
will this work ? will the model be able to interpret that it needs to loop 300 times at each training example?
The model is quite big it has skip connections and lots of other things that's why i need your help to figure out this subset of my problem before i implement the rest, and please ignore the syntax, this is just pseudocode.
Also, I need to call this model again n again, so i think the iterative way will slow down the process by quite a lot right?
Related
I'm trying to implement the recurrent neural network architecture proposed in this paper (https://arxiv.org/abs/1611.03824), where the authors use a LSTM to minimize a black-box function (which however is assumed to be differentiable). Here is a diagram of the proposed architecture: RNN. Briefly, the idea is to use an LSTM like an optimizer, which has to learn a good heuristic to propose new parameters for the unknown function y=f(parameters), so that it moves towards a minimum. Here's how the proposed procedure works:
Select an initial value for the parameters p0, and for the function y0 = f(p0)
Call to LSTM cell with input=[p0,y0], and whose output is a new value for the parameters output=p1
Evaluate y1 = f(p1)
Call the LSTM cell with input=[p1,y1], and obtain output=p2
Evaluate y2 = f(p2)
Repeat for few times, for example stopping at fifth iteration: y5 = f(p5).
I'm trying to implement a similar model in Tensorflow/Keras but I'm having some troubles. In particular, this case is different from "standard" ones because we don't have a predefinite time sequence to be analyzed, but instead it is generated online, after each iteration of the LSTM cell. Thus, in this case, our input would consist of just the starting guess [p0,y0=f(p0)] at time t=0. If I understood it correctly, this model is similar to the one-to-many LSTM, but with the difference that the input to the next time step does not come from just the previous cell, but also form the output an additional function (in our case f).
I managed to create a custom tf.keras.layers.Layer which performs the calculation for a single time step (that is it performs the LSTM cell and then use its output as input to the function f):
class my_layer(tf.keras.layers.Layer):
def __init__(self, units = 4):
super(my_layer, self).__init__()
self.cell = tf.keras.layers.LSTMCell(units)
def call(self, inputs):
prev_cost = inputs[0]
prev_params = inputs[1]
prev_h = inputs[2]
prev_c = inputs[3]
# Concatenate the previous parameters and previous cost to create new input
new_input = tf.keras.layers.concatenate([prev_cost, prev_params])
# New parameters obtained by the LSTM cell, along with new internsal states: h and c
new_params, [new_h, new_c] = self.cell(new_input, states = [prev_h, prev_c])
# Function evaluation
new_cost = f(new_params)
return [new_cost, new_params, new_h, new_c]
but I do not know how to build the recurrent part. I tried to do it manually, that is doing something like:
my_cell = my_layer(units = 4)
outputs = my_cell(inputs)
outputs1 = my_cell(outputs)
outputs2 = my_cell(outputs1)
Is that correct? Is there some other way to do it more appropriately?
Bonus question: I would like to train the LSTM to be able to optimize not only a single function f, but rather a class of different functions [f1, f2, ...] which share some common structure which make them similar enough to be optimized using the same LSTM. How could I implement such a training loop which takes as inputs a list of this functions [f1, f2, ...], and tries to minimize them all? My first thought was to do that "brute force" way: use a for loop over the function and a tf.GradientTape which evaluates and applies the gradients for each function.
Any help is much appreciated!
Thank you very much in advance! :)
In my TensorFlow code I want my network to take inputs from one of the two StagingArea objects depending upon whether I want to do training or testing.
A part of the graph construction code I wrote is as follows :
with tf.device("/gpu:0"):
for i in range(numgpus):
with tf.variable_scope(tf.get_variable_scope(), reuse=i>0) as vscope:
with tf.device('/gpu:{}'.format(i)):
with tf.name_scope('GPU-Tower-{}'.format(i)) as scope:
phase = tf.get_variable("phase", [], initializer=tf.zeros_initializer(),dtype=tf.uint8, trainable=False)
phaseassigntest = phase.assign(1)
phaseassigntrain = phase.assign(0)
phasetest = tf.equal(phase, 0)
is_training = tf.cond(phasetest, lambda: tf.constant(True), lambda: tf.constant(False))
trainstagingarea = tf.contrib.staging.StagingArea([tf.float32, tf.int32], shapes=[[trainbatchsize, 3, 221, 221], [trainbatchsize]], capacity=20)
putoptrain = trainstagingarea.put(train_iterator.get_next())
trainputop.append(putoptrain)
getoptrain = trainstagingarea.get()
traingetop.append(getoptrain)
trainclearop = trainstagingarea.clear()
trainstageclear.append(trainclearop)
trainsizeop = trainstagingarea.size()
trainstagesize.append(trainsizeop)
valstagingarea = tf.contrib.staging.StagingArea([tf.float32, tf.int32], shapes=[[valbatchsize, 3, 221, 221], [valbatchsize]], capacity=20)
putopval = valstagingarea.put(val_iterator.get_next())
valputop.append(putopval)
getopval = valstagingarea.get()
valgetop.append(getopval)
valclearop = valstagingarea.clear()
valstageclear.append(valclearop)
valsizeop = valstagingarea.size()
valstagesize.append(valsizeop)
#elem = valgetop[i]
elem = tf.cond(is_training,lambda: traingetop[i],lambda: valgetop[i])
img = elem[0]
label = elem[1]
labelonehot = tf.one_hot(label, depth=numclasses)
net, networksummaries = overfeataccurate(img,numclasses=numclasses, phase=is_training)
I have used tf.cond to make sure that the network is fed by one of the two StagingArea objects. One is meant for training and the other one is meant for validation.
Now, when I try to execute the graph as follows, I do not get any result and infact the code just hangs and I have to kill the process.
with tf.Session(graph=g,config=config) as sess:
sess.run(init_op)
sess.run(tf.local_variables_initializer())
sess.run(val_initialize)
for i in range(20):
sess.run(valputop)
print(sess.run(valstagesize))
writer = tf.summary.FileWriter('.', graph=tf.get_default_graph())
epoch = 0
iter = 0
print("Performing Validation")
sess.run(phaseassigntest)
saver = tf.train.Saver()
while(epoch<10):
time_init = time.time()
while True:
try:
[val_accu, _, summaries] = sess.run([towervalidation, towervalidationupdateop,validation_summary_op])
print(val_accu)
when instead of tf.cond() I directly assign elem = valgetop[i], the code works just fine.
Am I missing something over here ?
What is the right way to feed my network based on whether I want to do training or testing ?
NOTE The error does not go away even if I set numgpus to 1.
Your problem
What you think tf.cond does
Based on the flag, execute what is required to put either traingetop[i] or valgetop[i] into your elem tensor.
What tf.cond actually does
Executes what is required to get both traingetop[i] and valgetop[i], then passes one of them into your elem tensor.
So
The reason it is hanging forever is because it's waiting for an element to be added to your training staging area (so that it can get that element and discard it). You're forgiven for not realising this is what it's doing; it's actually very counter-intuitive. The documentation is awfully unclear on how to deal with this.
Recommended Solution (by Tensorflow documentation)
If you really need the queues to be in the same graph, then you need to make two copies of your ENTIRE graph, one that is fed by your training staging area, and one that is fed by your validation staging area. Then you just use the relevant tensor in your sess.run call. I recommend creating a function that takes a queue output tensor, and returns a model_output tensor. Now you have a train_time_output tensor and a validation_time_output tensor, and you can choose which one you want to execute in your sess.run.
Warning
You need to make sure that you aren't actually creating new variables to go along with these new ops. To do that take a look at the latest documentation on variables. It looks like they've simplified it from v0.12, and it essentially boils down to using tf.get_variable instead of tf.Variable to create your variables.
My preferred work around
Although that is the recommended solution (AFAIK), it is extremely unsatisfying to me; you're creating a whole other set of operations on the graph that just happen to use the same weights. It seems like there's a lot of potential for programmer error by abusing the separation between train time and test/validation time (resulting in the model acting unexpectedly different at these times). Worse; it doesn't solve the problem of tf.cond demanding the values for inputs to both branches, it just forces you to copy your whole graph, which is not always possible.
I prefer to just not have my queues in the graph like that, and treat the model as a function which can be fed an example without caring where it's from. That is, I would instantiate the model with a tf.placeholder as the input, and at execution time I would use feed_dict to actually provide the value. It would function something like this
#inside main training loop
if time_to_train:
example = sess.run(traingettop)
else:
example = sess.run(valgettop)
result = sess.run(model_output, {input_placeholder: example})
It's very useful to note that you can use the feed_dict to feed any value for any tensor anywhere in your model. So, you can change any model definition that, due to tf.cond would always require an input, like:
a = tf.constant(some_value)
b = tf.placeholder(tf.float32)
flag = tf.placeholder(tf.bool, [])
one_of_them = tf.cond(flag, a, b)
model_output = build_graph(one_of_them)
Into a definition that doesn't, like:
a = tf.constant(some_value)
model_output = build_graph(a)
Remembering that you can always overwrite what a is at execution time:
# In main training loop,
sess.run(train_op, {a: some_other_value})
This essentially pushes the conditional into native python land. In your code you might end up with something like:
if condition_satisfied:
sess.run(train_op, {a:some_other_value})
else:
sess.run(train_op)
Performance concerns
If you are using tensorflow on a single machine, then there is practically no performance cost to this solution, as the numpy array/s put into the example python variable are actually still stored on the GPU.
If you are using tensorflow in a distributed fashion, then this solution would kill your performance; it would require sending the example from whatever machine it's on to the master so that it can send it back.
Context:
I have a recurrent neural network with LSTM cells
The input to the network is a batch of size (batch_size, number_of_timesteps, one_hot_encoded_class) in my case (128, 300, 38)
The different rows of the batch (1-128) are not necessarily related
to each other
The target for one time step is given by the value of the next
time step.
My questions:
When I train the network using an input batch of (128,300,38) and a target batch of the same size,
does the network always consider only the last time-step t to predict the value of the next timestep t+1?
or does it consider all time steps from the beginning of the sequence up to time step t?
or does the LSTM cell internally remember all previous states?
I am confused about the functioning because the network is trained on multiple time steps simulatenously so I am not sure how the LSTM cell can still have knowledge of the previous states.
I hope somebody can help. Thanks in advance!
Code for dicussion:
cells = []
for i in range(self.n_layers):
cell = tf.contrib.rnn.LSTMCell(self.n_hidden)
cells.append(cell)
cell = tf.contrib.rnn.MultiRNNCell(cells)
init_state = cell.zero_state(self.batch_size, tf.float32)
outputs, final_state = tf.nn.dynamic_rnn(
cell, inputs=self.inputs, initial_state=init_state)
self.logits = tf.contrib.layers.linear(outputs, self.num_classes)
softmax_ce = tf.nn.sparse_softmax_cross_entropy_with_logits(
labels=labels, logits=self.logits)
self.loss = tf.reduce_mean(softmax_ce)
self.train_step = tf.train.AdamOptimizer(self.lr).minimize(self.loss)
The above is a simple RNN unrolled to the neuron level with 3 time steps.
As you can see that the output at time step t, depends upon all time steps from the beginning. The network is trained using back-propagation through time where the weights are updated by the contribution of all error gradients across time. The weights are shared across time, so there is nothing like simultaneous update on all time steps.
The knowledge of the previous states are transfered through the state variable s_t as it is a function of previous inputs. So at any time step, the prediction is made based on the current input as well as (function of) previous inputs captured by the state variable.
NOTE: A basic rnn was used instead of LSTM because of simplicity.
Here's what would be helpful to keep in mind for your case specifically:
Given the input shape of [128, 300, 38]
One call to dynamic_rnn will propagate through all 300 steps, and if you are using something like LSTM, the state will also be carried through those 300 steps
However, each SUBSEQUENT call to dynamic_rnn will not automatically remember the state from the previous call. By the second call, the weights/etc. will have been updated thanks to the first call, but you will still need to pass the state that resulted from the first call into the second call. That's why dynamic_rnn has a parameter initial_state and that's why one of its outputs is final_state (i.e. the state after processing all 300 steps in ONE call). So you are meant to take the final state from call N and pass it back as the initial state for call N+1 to dynamic_rnn. This allrelates specifically to LSTM, since this is what you asked for
You are right to note that elements in one batch don't necessarily relate to each other within the same batch. This is something you need to consider carefully. Because with successive calls to dynamic_rnn, batch elements in your input sequences have to relate to their respective counterparts in the previous/following sequence, but not to each other. I.e. element 3 in the first call may have nothing to do with the other 127 elements within the same batch, but element 3 in the NEXT call has to be the temporal/logical continuation of element 3 in the PREVIOUS call, and so forth. This way, the state that you keep passing forward makes sense continuously
I am trying to implemente a Memory-augmented neural network, in which the memory and the read/write/usage weight vectors are updated according to a combination of their previous values. These weigths are different from the classic weight matrices between layers that are automatically updated with the fit() function! My problem is the following: how can I correctly initialize these weights as keras tensors and use them in the model? I explain it better with the following simplified example.
My API model is something like:
input = Input(shape=(5,6))
controller = LSTM(20, activation='tanh',stateful=False, return_sequences=True)(input)
write_key = Dense(4,activation='tanh')(controller)
read_key = Dense(4,activation='tanh')(controller)
w_w = Add()([w_u, w_r]) #<---- UPDATE OF WRITE WEIGHTS
to_write = Dot()([w_w, write_key])
M = Add()([M,to_write])
cos_sim = Dot()([M,read_key])
w_r = Lambda(lambda x: softmax(x,axis=1))(cos_sim) #<---- UPDATE OF READ WEIGHTS
w_u = Add()([w_u,w_r,w_w]) #<---- UPDATE OF USAGE WEIGHTS
retrieved_memory = Dot()([w_r,M])
controller_output = concatenate([controller,retrieved_memory])
final_output = Dense(6,activation='sigmoid')(controller_output)`
You can see that, in order to compute w_w^t, I have to have first defined w_r^{t-1} and w_u^{t-1}. So, at the beginning I have to provide a valid initialization for these vectors. What is the best way to do it? The initializations I would like to have are:
M = K.variable(numpy.zeros((10,4))) # MEMORY
w_r = K.variable(numpy.zeros((1,10))) # READ WEIGHTS
w_u = K.variable(numpy.zeros((1,10))) # USAGE WEIGHTS`
But, analogously to what said in #2486(entron), these commands do not return a keras tensor with all the needed meta-data and so this returns the following error:
AttributeError: 'NoneType' object has no attribute 'inbound_nodes'
I also thought to use the old M, w_r and w_u as further inputs at each iteration and analogously get in output the same variables to complete the loop. But this means that I have to use the fit() function to train online the model having just the target as final output (Model 1), and employ the predict() function on the model with all the secondary outputs (Model 2) to get the variables to use at the next iteration. I have also to pass the weigth matrices from Model 1 to Model 2 using get_weights() and set_weights(). As you can see, it becomes a little bit messy and too slow.
Do you have any suggestions for this problem?
P.S. Please, do not focus too much on the API model above because it is a simplified (almost meaningless) version of the complete one where I skipped several key steps.
I have been working with the "dynamic_rnn" to create a model.
The model is based upon a 80 time period signal, and I want to zero the "initial_state" before each run so I have setup the following code fragment to accomplish this:
state = cell_L1.zero_state(self.BatchSize,Xinputs.dtype)
outputs, outState = rnn.dynamic_rnn(cell_L1,Xinputs,initial_state=state, dtype=tf.float32)
This works great for the training process. The problem is once I go to the inference, where my BatchSize = 1, I get an error back as the rnn "state" doesn't match the new Xinputs shape. So what I figured is I need to make "self.BatchSize" based upon the input batch size rather than hard code it. I tried many different approaches, and none of them have worked. I would rather not pass a bunch of zeros through the feed_dict as it is a constant based upon the batch size.
Here are some of my attempts. They all generally fail since the input size is unknown upon building the graph:
state = cell_L1.zero_state(Xinputs.get_shape()[0],Xinputs.dtype)
.....
state = tf.zeros([Xinputs.get_shape()[0], self.state_size], Xinputs.dtype, name="RnnInitializer")
Another approach, thinking the initializer might not get called until run-time, but still failed at graph build:
init = lambda shape, dtype: np.zeros(*shape)
state = tf.get_variable("state", shape=[Xinputs.get_shape()[0], self.state_size],initializer=init)
Is there a way to get this constant initial state to be created dynamically or do I need to reset it through the feed_dict with tensor-serving code? Is there a clever way to do this only once within the graph maybe with an tf.Variable.assign?
The solution to the problem was how to obtain the "batch_size" such that the variable is not hard coded.
This was the correct approach from the given example:
Xinputs = tf.placeholder(tf.int32, (None, self.sequence_size, self.num_params), name="input")
state = cell_L1.zero_state(Xinputs.get_shape()[0],Xinputs.dtype)
The problem is the use of "get_shape()[0]", this returns the "shape" of the tensor and takes the batch_size value at [0]. The documentation doesn't seem to be that clear, but this appears to be a constant value so when you load the graph into an inference, this value is still hard coded (maybe only evaluated at graph creation?).
Using the "tf.shape()" function, seems to do the trick. This doesn't return the shape, but a tensor. So this seems to be updated more at run-time. Using this code fragment solved the problem of a training batch of 128 and then loading the graph into TensorFlow-Service inference handling a batch of just 1.
Xinputs = tf.placeholder(tf.int32, (None, self.sequence_size, self.num_params), name="input")
batch_size = tf.shape(Xinputs)[0]
state = self.cell_L1.zero_state(batch_size,Xinputs.dtype)
Here is a good link to TensorFlow FAQ which describes this approach 'How do I build a graph that works with variable batch sizes?':
https://www.tensorflow.org/resources/faq