Confused by the behavior of `tf.cond` - tensorflow

I need a conditional control flow in my graph. If pred is True, the graph should call an op that updates a variable and then returns it, otherwise it returns the variable unchanged. A simplified version is:
pred = tf.constant(True)
x = tf.Variable([1])
assign_x_2 = tf.assign(x, [2])
def update_x_2():
with tf.control_dependencies([assign_x_2]):
return tf.identity(x)
y = tf.cond(pred, update_x_2, lambda: tf.identity(x))
with tf.Session() as session:
session.run(tf.initialize_all_variables())
print(y.eval())
However, I find that both pred=True and pred=False lead to the same result y=[2], which means the assign op is also called when update_x_2 is not selected by tf.cond. How to explain this? And how to solve this problem?

TL;DR: If you want tf.cond() to perform a side effect (like an assignment) in one of the branches, you must create the op that performs the side effect inside the function that you pass to tf.cond().
The behavior of tf.cond() is a little unintuitive. Because execution in a TensorFlow graph flows forward through the graph, all operations that you refer to in either branch must execute before the conditional is evaluated. This means that both the true and the false branches receive a control dependency on the tf.assign() op, and so y always gets set to 2, even if pred is False.
The solution is to create the tf.assign() op inside the function that defines the true branch. For example, you could structure your code as follows:
pred = tf.placeholder(tf.bool, shape=[])
x = tf.Variable([1])
def update_x_2():
with tf.control_dependencies([tf.assign(x, [2])]):
return tf.identity(x)
y = tf.cond(pred, update_x_2, lambda: tf.identity(x))
with tf.Session() as session:
session.run(tf.initialize_all_variables())
print(y.eval(feed_dict={pred: False})) # ==> [1]
print(y.eval(feed_dict={pred: True})) # ==> [2]

pred = tf.constant(False)
x = tf.Variable([1])
def update_x_2():
assign_x_2 = tf.assign(x, [2])
with tf.control_dependencies([assign_x_2]):
return tf.identity(x)
y = tf.cond(pred, update_x_2, lambda: tf.identity(x))
with tf.Session() as session:
session.run(tf.initialize_all_variables())
print(y.eval())
This will get the result of [1].
This answer is quite the same as the above answer. But what I wanna share is you can put every ops you would like to use in its branch function. Because, given your example code, tensor x is can be directly used by the update_x_2 function.

Related

How to use an optimizer within a forward pass in PyTorch

I want to use an optimizer within the forward pass of a custom defined Function, but it doesn't work. My code is as follows:
class MyFct(Function):
#staticmethod
def forward(ctx, *args):
input, weight, bias = args[0], args[1], args[2]
y = torch.tensor([[0]], dtype=torch.float, requires_grad=True) #initial guess
loss_fn = lambda y_star: (input + weight - y_star)**2
learning_rate = 1e-4
optimizer = torch.optim.Adam([y], lr=learning_rate)
for t in range(5000):
y_star = y
print(y_star)
loss = loss_fn(y_star)
if t % 100 == 99:
print(t, loss.item())
optimizer.zero_grad()
loss.backward()
optimizer.step()
return y_star
And that's my test inputs:
x = torch.tensor([[2]], dtype=torch.float, requires_grad=True)
w = torch.tensor([[2]], dtype=torch.float, requires_grad=True)
y = torch.tensor([[6]], dtype=torch.float)
fct= MyFct.apply
y_hat = fct(x, w, None)
I always get the RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn.
Also, I've tested the optimization outside of the forward and it works, so I guess it's something with the context? According to the documentation "Tensor arguments that track history (i.e., with requires_grad=True) will be converted to ones that don’t track history before the call, and their use will be registered in the graph", see https://pytorch.org/docs/stable/notes/extending.html. Is this the problem? Is there a way to work around it?
I am new to PyTorch and I wonder what I'm overlooking. Any help and explanation is appreciated.
I think I found an answer here: https://github.com/pytorch/pytorch/issues/8847 , i.e. I need to wrap the oprimization with with torch.enable_grad():.
However, I still don't understand why it's necessary to convert the original Tensors to ones that don’t track history in forward().

Tensorflow: Don't Update if gradient is Nan

I have a deep model to train on CIFAR-10. Training works fine with CPU. However, when I use GPU support, it causes gradients for some batches to be NaNs (I checked it using tf.check_numerics) and it happens randomly but early enough. I believe the problem is related to my GPU.
My question is that: is there away not to update if at least one of the gradients has NaNs and force the model to proceed to the next batch ?
Edit: Perhaps I should elaborate more on my problem.
This is how I apply the gradients:
with tf.control_dependencies([tf.check_numerics(grad, message='Gradient %s check failed, possible NaNs' % var.name) for grad, var in grads]):
# Apply the gradients to adjust the shared variables.
apply_gradient_op = opt.apply_gradients(grads, global_step=global_step)
I have thought of using tf.check_numerics first to verify that there are Nans in the gradients, and, then, if there are Nans (check failed) I can "pass" without using opt.apply_gradients. However, is there a way to catch an error with tf.control_dependencies ?
I could figure it out, albeit not in the most elegant way.
My solution is as follows:
1) check all gradients first
2) if gradients are NaNs-free, apply them
3) otherwise, apply fake update (with zero values), this needs gradient override.
This is my code:
First define custom gradient:
#tf.RegisterGradient("ZeroGrad")
def _zero_grad(unused_op, grad):
return tf.zeros_like(grad)
Then define an exception-handling function:
#this is added for gradient check of NaNs
def check_numerics_with_exception(grad, var):
try:
tf.check_numerics(grad, message='Gradient %s check failed, possible NaNs' % var.name)
except:
return tf.constant(False, shape=())
else:
return tf.constant(True, shape=())
Then create conditional node:
num_nans_grads = tf.Variable(1.0, name='num_nans_grads')
check_all_numeric_op = tf.reduce_sum(tf.cast(tf.stack([tf.logical_not(check_numerics_with_exception(grad, var)) for grad, var in grads]), dtype=tf.float32))
with tf.control_dependencies([tf.assign(num_nans_grads, check_all_numeric_op)]):
# Apply the gradients to adjust the shared variables.
def fn_true_apply_grad(grads, global_step):
apply_gradients_true = opt.apply_gradients(grads, global_step=global_step)
return apply_gradients_true
def fn_false_ignore_grad(grads, global_step):
#print('batch update ignored due to nans, fake update is applied')
g = tf.get_default_graph()
with g.gradient_override_map({"Identity": "ZeroGrad"}):
for (grad, var) in grads:
tf.assign(var, tf.identity(var, name="Identity"))
apply_gradients_false = opt.apply_gradients(grads, global_step=global_step)
return apply_gradients_false
apply_gradient_op = tf.cond(tf.equal(num_nans_grads, 0.), lambda : fn_true_apply_grad(grads, global_step), lambda : fn_false_ignore_grad(grads, global_step))

Tensorflow - running total

How can I add the number 5 after every iteration of the loop?
I want to do something like this:
weight = 0.225
for i in range(10):
weight += 5
print (weight)
Here is how I am trying in tensorflow but it never updates the weight
import tensorflow as tf
def dummy(x):
weights['h0'] = tf.add(weights['h0'], 5)
res = tf.add(weights['h0'], x)
return res
# build computational graph
a = tf.placeholder('float', None)
d = dummy(a)
weights = {
'h0': tf.Variable(tf.random_normal([1]))
}
# initialize variables
init = tf.global_variables_initializer()
# create session and run the graph
with tf.Session() as sess:
sess.run(init)
for i in range(10):
print (sess.run(d, feed_dict={a: [2]}))
# close session
sess.close()
There's an operation explicitly created for adding a value and assigning the result back to the input node: tf.assign_add
You should use it instead of tf.assing + tf.add.
Also, it's more important that you understand why you previous code won't work.
weights['h0'] = tf.add(weights['h0'], 5)
res = tf.add(weights['h0'], x)
At the fist line, you're defining a node add, whose inputs are weights['h0'] and 5 and you're assigning this node to a python variable weights['h0'].
Now, thus, weights['h0'] is a python variable holding a tensorflow node.
In the next line, you're defining another add node, between the previous node and x, and you return this node.
When the graph is evaluated, you evaluate the node pointed by res, that force the evaluation of the previous node (because res is a function of the node holded by weights['h0']).
The problem is the that your assignment at line 1 is a python assignment and not a tensorflow assignment.
Thus that assign operation is executed only in the python environment but it has no defined an assign node into the tensorflow graph.
P.S: when you use with you're defining a context manager that handles the closing operations for you. You can thus remove sess.close() because is executed automatically when you exit from that context
Apparently there is an assign operator
https://www.tensorflow.org/api_docs/python/tf/assign
weights['h0'] = tf.assign(weights['h0'], tf.add(weights['h0'], 5))

Reevaluate dependencies of a while loop

I am trying to understand how while loops work in tensorflow. In particular I have a variable, x say, that I update in the while loop, and then I have some values that depends on x, but when running the while loop the values does not seem to be updated when x changes.
The following code where I have tried to implement a simple gradient decent optimizer might illustrate what I mean:
import tensorflow as tf
x = tf.Variable(initial_value=4, dtype=tf.float32, trainable=False)
y = tf.multiply(x,x)
grad = tf.gradients(y, x)
def update_g():
with tf.control_dependencies(grad):
return tf.identity(grad[0])
iterations = tf.placeholder(tf.int32)
i = tf.constant(0, dtype=tf.int32)
g = tf.Variable(initial_value=grad[0], dtype=tf.float32, trainable=False)
c = lambda i_loop, x_loop, g_loop: i_loop < iterations
b = lambda i_loop, x_loop, g_loop: [i_loop+1, tf.assign(x, x_loop - 10*g_loop), update_g()]
l = tf.while_loop(c, b, [i, x, g], back_prop=False, parallel_iterations=1)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
res_g = sess.run(grad)
res_l = sess.run(l, feed_dict={iterations: 10})
res_x = sess.run(x)
print(res_g)
print(res_l)
print(res_x)
Running this on tensorflow 1.0 gives this result for me:
[8.0]
[10, -796.0, 8.0]
-796.0
and the issue is that the value of the gradient is not updated as x changes.
I have tried various variations on the above code, but can not seem to find a version that works. Basically my question is if the above can be made to work, or do I need to rethink the approach.
(Maybe I should add that I am not interested in writing a gradient decent optimizer, I just built this to have something simple and understandable to work with.)
With some help from the other answer I managed to get this working. Posting the complete code here as a second answer:
x = tf.constant(4, dtype=tf.float32)
y = tf.multiply(x,x)
grad = tf.gradients(y, x)
def loop_grad(x_loop):
y2 = tf.multiply(x_loop, x_loop)
return tf.gradients(y2, x_loop)[0]
iterations = tf.placeholder(tf.int32)
i = tf.constant(0, dtype=tf.int32)
c = lambda i_loop, x_loop: i_loop < iterations
b = lambda i_loop, x_loop: [i_loop+1, x_loop - 0.1*loop_grad(x_loop)]
l = tf.while_loop(c, b, [i, x], back_prop=False, parallel_iterations=1)
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.05)
with tf.Session(config=tf.ConfigProto(gpu_options=gpu_options)) as sess:
sess.run(tf.global_variables_initializer())
res_g = sess.run(grad)
res_l = sess.run(l, feed_dict={iterations: 100000})
res_x = sess.run(x)
print(res_g)
print(res_l)
print(res_x)
changing the learning rate from the code in the question and increasing the number of iterations gives the output:
[8.0]
[100000, 5.1315068e-38]
4.0
Which seems to be working. It runs reasonably fast even with high iteration count, so there does not seem to be something really horrible going on with updating the graph in each iteration of the while loop, a fear of which probably was one reason why I didn't opt for this approach from the start.
Having tf.Variable objects as loop variables for while loops is not supported, and will behave in weird nondeterministic ways. Always use tf.assign and friends to update the value of a tf.Variable.

Tensorflow: LSTM with moving average of weights of another LSTM

I would like to have an LSTM in tensorflow whose weights are the exponential moving average of the weights of another LSTM. So basically I have this code with some input placeholder and some initial state placeholder:
def test_lstm(input_ph, init_ph, scope):
cell = tf.nn.rnn_cell.LSTMCell(128, use_peepholes=True)
input_ph = tf.transpose(input_ph, [1, 0, 2])
input_list = tf.unpack(input_ph)
with tf.variable_scope(scope) as vs:
outputs, states = tf.nn.rnn(cell, input_list, initial_state=init_ph)
theta = [v for v in tf.all_variables() if v.name.startswith(vs.name)]
return outputs, theta
lstm_1, theta_1 = test_lstm(input_1, state_init_1, scope="lstm_1")
What I would like to do now is something similar along these lines (which don't actually work because the exponential moving average puts the tag "ema" behind the variable name of the weights and they do not appear in variable scope because they were not created with tf.get_variable):
ema = tf.train.ExponentialMovingAverage(decay=1-self.tau, name="ema")
with tf.variable_scope("lstm_2"):
maintain_averages_theta_1 = ema.apply(theta_1)
theta_2_1 = [ema.average(x) for x in theta_1]
lstm_2 , theta_2_2 = test_lstm(input_2, state_init_2, scope="lstm_2"
where eventually theta_2_1 would be equal to theta_2_2 (or throw an exception because the variables already exist).
Change (3/May/2018): I added one line apart from the old answer. The old one itself is self-sufficient to this question, but it real practice we initialize 'target network' as the same value to the 'behavioural network'. I added this point. Search the line having 'Change0': You need to do this only once right after when the target network's parameters are just initialized. Without 'Change0', you first round of gradient-based update would become nasty since Q_target and Q_behaviour are not correlated while you need both to be somewhat related ex) TD = r + gamma*Q_target - Q_behaviour for the update.
Seems a bit late but hope this helps.
The critical problem of TF-RNN series possess is that we cannot directly designate the variables for RNNs unlike plain feedforward or convolutional NNs, so that we can't do simple work - get EMAed vars and plug into the network.
Let's go into the real deal(I attached a practice code to look-up for this, so please refer EMA_LSTM.py).
Shall we say, there is a network function containing LSTM:
def network(in_x,in_h):
# Input: 'in_x' is the input sequence, 'in_h' is the initial hidden cell(c,m)
# Output: 'hidden_outputs' is the output sequence(c-sequence), 'net' is the list of parameters used in this network function
cell = tf.nn.rnn_cell.BasicLSTMCell(3, state_is_tuple=True)
in_h = tf.nn.rnn_cell.LSTMStateTuple(in_h[0], in_h[1])
hidden_outputs, last_tuple = tf.nn.dynamic_rnn(cell, in_x, dtype=tf.float32, initial_state=in_h)
net = [v for v in tf.trainable_variables() if tf.contrib.framework.get_name_scope() in v.name]
return hidden_outputs, net
Then you declare tf.placeholder for the necessary inputs, which are:
in_x = tf.placeholder("float", [None,None,6])
in_c = tf.placeholder("float", [None,3])
in_m = tf.placeholder("float", [None,3])
in_h = (in_c, in_m)
Lastly, we run a session that proceeds the network() function with specified inputs:
init_cORm = np.zeros(shape=(1,3))
input = np.ones(shape=(1,1,6))
print '========================new-1(beh)=============================='
with tf.Session() as sess:
with tf.variable_scope('beh', reuse=False) as beh:
result, net1 = network(in_x,in_h)
sess.run(tf.global_variables_initializer())
list = sess.run([result, in_x, in_h, net1], feed_dict={
in_x : input,
in_c : init_cORm,
in_m : init_cORm
})
print 'result:', list[0]
print 'in_x:' , list[1]
print 'in_h:', list[2]
print 'net1:', list[3][0][0][:4]
Now, we are going to make the var_list called 'net4' that contains the ExponentialMovingAverage(EMA)-ed values of 'net1', as in below with original 'net1' being first assigned in beh session above and newly assigned in below by adding 1. for each element:
ema = tf.train.ExponentialMovingAverage(decay=0.5)
target_update_op = ema.apply(net1)
init_new_vars_op = tf.initialize_variables(var_list=[v for v in tf.global_variables() if 'ExponentialMovingAverage' in v.name]) # 'initialize_variables' will be replaced with 'variables_initializer' in 2017
sess.run(init_new_vars_op)
sess.run([param4.assign(param1.eval()) for param4, param1 in zip(net4,net1)]) # Change0
len_net1 = len(net1)
net1_ema = [[] for i in range(len_net1)]
for i in range(len_net1):
sess.run(net1[i].assign(1. + net1[i]))
sess.run(target_update_op)
Note that
we only initialised(by declaring 'init_new_vars_op' then running the declaration job) the variables of their name containing 'ExponentialMovingAverage', if not the variables in net1 will also be newly initialised.
'net1' is newly assigned with +1 for every elements of the variables in 'net1'. If an element of 'net1' was -0.5 and is now 0.5 by +1, then we want to have 'net4' as 0. when the EMA decay rate is 0.5
Finally, we run the EMA job with 'sess.run(target_update_op)'
Eventually, we declare 'net4' first with the 'network()' function and then assign&run the EMA(net1) values into 'net4'. When you run 'sess.run(result)', then it will be the one with EMA(net1)-ed variables.
with tf.variable_scope('tare', reuse=False) as tare:
result, net4 = network(in_x,in_h)
len_net4 = len(net4)
target_assign = [[] for i in range(len_net4)]
for i in range(len_net4):
target_assign[i] = net4[i].assign(ema.average(net1[i]))
sess.run(target_assign[i].op)
list = sess.run([result, in_x, in_h, net4], feed_dict={
in_x : input,
in_c : init_cORm,
in_m : init_cORm
})
What's happened in here? You just indirectly declares the LSTM variables as 'net4' in the 'network()' function. Then in the for loop, we points out that 'net4' is actually EMA of 'net1' and 'net1+1.' . Lastly, with the net4 specified what to do with(via 'network()') and what values it takes(via 'for loop of .assign(ema.average()) to itself'), you run the process.
It is somewhat counter-intuitive that we declare the 'result' first and specifies the value of parameters second. However, that's the nature of TF what they exactly look for, as it is always logical to set variables, processes, and their relation first then to assign values, then runs the processes.
Lastly, few things to go further for the real machine learning codes:
In here, I just second assigned 'net1' with 'net1 + 1.'. In real case, this '+1.' step is where you 'sess.run()'(after you 'declare' in somewhere) your optimiser. So every-time after you 'sess.run(optimiser)', 'sess.run(target_update_op)' and then'sess.run(target_assign[i].op)' should follow to update your 'net4' along the EMA of 'net1'. Conceretly, you can do this job with different order like below:
ema = tf.train.ExponentialMovingAverage(decay=0.5)
target_update_op = ema.apply(net1)
with tf.variable_scope('tare', reuse=False) as tare:
result, net4 = network(in_x,in_h)
len_net4 = len(net4)
target_assign = [[] for i in range(len_net4)]
for i in range(len_net4):
target_assign[i] = net4[i].assign(ema.average(net1[i]))
init_new_vars_op = tf.initialize_variables(var_list=[v for v in tf.global_variables() if 'ExponentialMovingAverage' in v.name]) # 'initialize_variables' will be replaced with 'variables_initializer' in 2017
sess.run(init_new_vars_op)
sess.run([param4.assign(param1.eval()) for param4, param1 in zip(net4,net1)]) # Change0
Lastly, be aware that you must sess.run(ema.apply(net)) right after the net changes, and then sess.run(net_ema.assign(ema.average(net)). W/O .apply, net_ema will not get assigned with averaged value
len_net1 = len(net1)
net1_ema = [[] for i in range(len_net1)]
for i in range(len_net1):
sess.run(net1[i].assign(1. + net1[i]))
sess.run(target_update_op)
for i in range(len_net4):
sess.run(target_assign[i].op)
list = sess.run([result, in_x, in_h, net4], feed_dict={
in_x : input,
in_c : init_cORm,
in_m : init_cORm
})