Reevaluate dependencies of a while loop - tensorflow

I am trying to understand how while loops work in tensorflow. In particular I have a variable, x say, that I update in the while loop, and then I have some values that depends on x, but when running the while loop the values does not seem to be updated when x changes.
The following code where I have tried to implement a simple gradient decent optimizer might illustrate what I mean:
import tensorflow as tf
x = tf.Variable(initial_value=4, dtype=tf.float32, trainable=False)
y = tf.multiply(x,x)
grad = tf.gradients(y, x)
def update_g():
with tf.control_dependencies(grad):
return tf.identity(grad[0])
iterations = tf.placeholder(tf.int32)
i = tf.constant(0, dtype=tf.int32)
g = tf.Variable(initial_value=grad[0], dtype=tf.float32, trainable=False)
c = lambda i_loop, x_loop, g_loop: i_loop < iterations
b = lambda i_loop, x_loop, g_loop: [i_loop+1, tf.assign(x, x_loop - 10*g_loop), update_g()]
l = tf.while_loop(c, b, [i, x, g], back_prop=False, parallel_iterations=1)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
res_g = sess.run(grad)
res_l = sess.run(l, feed_dict={iterations: 10})
res_x = sess.run(x)
print(res_g)
print(res_l)
print(res_x)
Running this on tensorflow 1.0 gives this result for me:
[8.0]
[10, -796.0, 8.0]
-796.0
and the issue is that the value of the gradient is not updated as x changes.
I have tried various variations on the above code, but can not seem to find a version that works. Basically my question is if the above can be made to work, or do I need to rethink the approach.
(Maybe I should add that I am not interested in writing a gradient decent optimizer, I just built this to have something simple and understandable to work with.)

With some help from the other answer I managed to get this working. Posting the complete code here as a second answer:
x = tf.constant(4, dtype=tf.float32)
y = tf.multiply(x,x)
grad = tf.gradients(y, x)
def loop_grad(x_loop):
y2 = tf.multiply(x_loop, x_loop)
return tf.gradients(y2, x_loop)[0]
iterations = tf.placeholder(tf.int32)
i = tf.constant(0, dtype=tf.int32)
c = lambda i_loop, x_loop: i_loop < iterations
b = lambda i_loop, x_loop: [i_loop+1, x_loop - 0.1*loop_grad(x_loop)]
l = tf.while_loop(c, b, [i, x], back_prop=False, parallel_iterations=1)
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.05)
with tf.Session(config=tf.ConfigProto(gpu_options=gpu_options)) as sess:
sess.run(tf.global_variables_initializer())
res_g = sess.run(grad)
res_l = sess.run(l, feed_dict={iterations: 100000})
res_x = sess.run(x)
print(res_g)
print(res_l)
print(res_x)
changing the learning rate from the code in the question and increasing the number of iterations gives the output:
[8.0]
[100000, 5.1315068e-38]
4.0
Which seems to be working. It runs reasonably fast even with high iteration count, so there does not seem to be something really horrible going on with updating the graph in each iteration of the while loop, a fear of which probably was one reason why I didn't opt for this approach from the start.

Having tf.Variable objects as loop variables for while loops is not supported, and will behave in weird nondeterministic ways. Always use tf.assign and friends to update the value of a tf.Variable.

Related

How to use an optimizer within a forward pass in PyTorch

I want to use an optimizer within the forward pass of a custom defined Function, but it doesn't work. My code is as follows:
class MyFct(Function):
#staticmethod
def forward(ctx, *args):
input, weight, bias = args[0], args[1], args[2]
y = torch.tensor([[0]], dtype=torch.float, requires_grad=True) #initial guess
loss_fn = lambda y_star: (input + weight - y_star)**2
learning_rate = 1e-4
optimizer = torch.optim.Adam([y], lr=learning_rate)
for t in range(5000):
y_star = y
print(y_star)
loss = loss_fn(y_star)
if t % 100 == 99:
print(t, loss.item())
optimizer.zero_grad()
loss.backward()
optimizer.step()
return y_star
And that's my test inputs:
x = torch.tensor([[2]], dtype=torch.float, requires_grad=True)
w = torch.tensor([[2]], dtype=torch.float, requires_grad=True)
y = torch.tensor([[6]], dtype=torch.float)
fct= MyFct.apply
y_hat = fct(x, w, None)
I always get the RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn.
Also, I've tested the optimization outside of the forward and it works, so I guess it's something with the context? According to the documentation "Tensor arguments that track history (i.e., with requires_grad=True) will be converted to ones that don’t track history before the call, and their use will be registered in the graph", see https://pytorch.org/docs/stable/notes/extending.html. Is this the problem? Is there a way to work around it?
I am new to PyTorch and I wonder what I'm overlooking. Any help and explanation is appreciated.
I think I found an answer here: https://github.com/pytorch/pytorch/issues/8847 , i.e. I need to wrap the oprimization with with torch.enable_grad():.
However, I still don't understand why it's necessary to convert the original Tensors to ones that don’t track history in forward().

RNN Slow-down phenomenon of Tensorflow

I found a peculiar property of lstm cell(not limited to lstm but I only examined with this) of tensorflow which has not been reported as far as I know.
I don't know whether it actually has, so I left this post in SO. Below is a toy code for this problem:
import tensorflow as tf
import numpy as np
import time
def network(input_list):
input,init_hidden_c,init_hidden_m = input_list
cell = tf.nn.rnn_cell.BasicLSTMCell(256, state_is_tuple=True)
init_hidden = tf.nn.rnn_cell.LSTMStateTuple(init_hidden_c, init_hidden_m)
states, hidden_cm = tf.nn.dynamic_rnn(cell, input, dtype=tf.float32, initial_state=init_hidden)
net = [v for v in tf.trainable_variables()]
return states, hidden_cm, net
def action(x, h_c, h_m):
t0 = time.time()
outputs, output_h = sess.run([rnn_states[:,-1:,:], rnn_hidden_cm], feed_dict={
rnn_input:x,
rnn_init_hidden_c: h_c,
rnn_init_hidden_m: h_m
})
dt = time.time() - t0
return outputs, output_h, dt
rnn_input = tf.placeholder("float", [None, None, 512])
rnn_init_hidden_c = tf.placeholder("float", [None,256])
rnn_init_hidden_m = tf.placeholder("float", [None,256])
rnn_input_list = [rnn_input, rnn_init_hidden_c, rnn_init_hidden_m]
rnn_states, rnn_hidden_cm, rnn_net = network(rnn_input_list)
feed_input = np.random.uniform(low=-1.,high=1.,size=(1,1,512))
feed_init_hidden_c = np.zeros(shape=(1,256))
feed_init_hidden_m = np.zeros(shape=(1,256))
sess = tf.Session()
sess.run(tf.global_variables_initializer())
for i in range(10000):
_, output_hidden_cm, deltat = action(feed_input, feed_init_hidden_c, feed_init_hidden_m)
if i % 10 == 0:
print 'Running time: ' + str(deltat)
(feed_init_hidden_c, feed_init_hidden_m) = output_hidden_cm
feed_input = np.random.uniform(low=-1.,high=1.,size=(1,1,512))
[Not important]What this code does is to generate an output from 'network()' function containing LSTM where the input's temporal dimension is 1, so output's is also 1, and pull in&out initial state for each step of running.
[Important] Looking the 'sess.run()' part. For some reasons in my real code, I happened to put [:,-1:,:] for 'rnn_states'. What is happening is then the time spent for each 'sess.run()' increases. For some inspection by my own, I found this slow down stems from that [:,-1:,:]. I just wanted to get the output at the last time step. If you do 'outputs, output_h = sess.run([rnn_states, rnn_hidden_cm], feed_dict{~' w/o [:,-1:,:] and take 'last_output = outputs[:,-1:,:]' after the 'sess.run()', then the slow down does not occur.
I do not know why this exponential increment of time happens with that [:,-1:,:] running. Is this the nature of tensorflow hasn't been documented but particularly slows down(may be adding more graph by its own?)?
Thank you, and hope this mistake not happen for other users by this post.
I encountered the same problem, with TensorFlow slowing down for each iteration I ran it, and found this question while trying to debug it. Here's a short description of my situation and how I solved it for future reference. Hopefully it can point someone in the right direction and save them some time.
In my case the problem was mainly that I didn't make use of feed_dict to supply the network state when executing sess.run(). Instead I redeclared outputs, final_state and prediction every iteration. The answer at https://github.com/tensorflow/tensorflow/issues/1439#issuecomment-194405649 made me realize how stupid that was... I was constantly creating new graph nodes in every iteration, making it all slower and slower. The problematic code looked something like this:
# defining the network
lstm_layer = rnn.BasicLSTMCell(num_units, forget_bias=1)
outputs, final_state = rnn.static_rnn(lstm_layer, input, initial_state=rnn_state, dtype='float32')
prediction = tf.nn.softmax(tf.matmul(outputs[-1], out_weights)+out_bias)
for input_data in data_seq:
# redeclaring, stupid stupid...
outputs, final_state = rnn.static_rnn(lstm_layer, input, initial_state=rnn_state, dtype='float32')
prediction = tf.nn.softmax(tf.matmul(outputs[-1], out_weights)+out_bias)
p, rnn_state = sess.run((prediction, final_state), feed_dict={x: input_data})
The solution was of course to only declare the nodes once in the beginning, and supply the new data with feed_dict. The code went from being half slow (> 15 ms in the beginning) and becoming slower for every iteration, to execute every iteration in around 1 ms. My new code looks something like this:
out_weights = tf.Variable(tf.random_normal([num_units, n_classes]), name="out_weights")
out_bias = tf.Variable(tf.random_normal([n_classes]), name="out_bias")
# placeholder for the network state
state_placeholder = tf.placeholder(tf.float32, [2, 1, num_units])
rnn_state = tf.nn.rnn_cell.LSTMStateTuple(state_placeholder[0], state_placeholder[1])
x = tf.placeholder('float', [None, 1, n_input])
input = tf.unstack(x, 1, 1)
# defining the network
lstm_layer = rnn.BasicLSTMCell(num_units, forget_bias=1)
outputs, final_state = rnn.static_rnn(lstm_layer, input, initial_state=rnn_state, dtype='float32')
prediction = tf.nn.softmax(tf.matmul(outputs[-1], out_weights)+out_bias)
# actual network state, which we input with feed_dict
_rnn_state = tf.nn.rnn_cell.LSTMStateTuple(np.zeros((1, num_units), dtype='float32'), np.zeros((1, num_units), dtype='float32'))
it = 0
for input_data in data_seq:
encl_input = [[input_data]]
p, _rnn_state = sess.run((prediction, final_state), feed_dict={x: encl_input, rnn_state: _rnn_state})
print("{} - {}".format(it, p))
it += 1
Moving the declaration out from the for loop also got rid of the problem which the OP sdr2002 had, doing a slice outputs[-1] in sess.run() inside the for loop.
As mentioned above, no sliced output for 'sess.run()' is much appreciated for this case.
def action(x, h_c, h_m):
t0 = time.time()
outputs, output_h = sess.run([rnn_states, rnn_hidden_cm], feed_dict={
rnn_input:x,
rnn_init_hidden_c: h_c,
rnn_init_hidden_m: h_m
})
outputs = outputs[:,-1:,:]
dt = time.time() - t0
return outputs, output_h, dt

Updating variable values in tensorflow

I've have a basic question about updating the values of tensors via the tensorflow python api.
Consider the code snippet:
x = tf.placeholder(shape=(None,10), ... )
y = tf.placeholder(shape=(None,), ... )
W = tf.Variable( randn(10,10), dtype=tf.float32 )
yhat = tf.matmul(x, W)
Now let's assume I want to implement some sort of algorithm that iteratively updates the value of W (e.g. some optimization algo). This will involve steps like:
for i in range(max_its):
resid = y_hat - y
W = f(W , resid) # some update
the problem here is that W on the LHS is a new tensor, not the W that is used in yhat = tf.matmul(x, W)! That is, a new variable is created and the value of W used in my "model" doesn't update.
Now one way around this would be
for i in range(max_its):
resid = y_hat - y
W = f(W , resid) # some update
yhat = tf.matmul( x, W)
which results in the creation of a new "model" for each iteration of my loop !
Is there a better way to implement this (in python) without creating a whole bunch of new models for each iteration of the loop - but instead updating the original tensor W "in-place" so to speak?
Variables have an assign method. Try:W.assign(f(W,resid))
#aarbelle's terse answer is correct, I'll expand it a bit in case someone needs more info. The last 2 lines below is used for updating W.
x = tf.placeholder(shape=(None,10), ... )
y = tf.placeholder(shape=(None,), ... )
W = tf.Variable(randn(10,10), dtype=tf.float32 )
yhat = tf.matmul(x, W)
...
for i in range(max_its):
resid = y_hat - y
update = W.assign(f(W , resid)) # do not forget to initialize tf variables.
# "update" above is just a tf op, you need to run the op to update W.
sess.run(update)
Precisely, the answer should be sess.run(W.assign(f(W,resid))). Then use sess.run(W) to show the change.

What does opt.apply_gradients() do in TensorFlow?

The documentation is not quite clear about this. I suppose the gradients one can obtain by opt.compute_gradients(E, [v]) contain the ∂E/∂x = g(x) for each element x of the tensor that v stores. Does opt.apply_gradients(grads_and_vars) essentially execute x ← -η·g(x), where η is the learning rate? That would imply that if I want to add a positive additive change p to the variable, I would need to need to change g(x) ← g(x) - (1/η)p, e.g. like this:
opt = tf.train.GradientDescentOptimizer(learning_rate=l)
grads_and_vars = opt.compute_gradients(loss, var_list)
for l, gv in enumerate(grads_and_vars):
grads_and_vars[l] = (gv[0] - (1/l) * p, gv[1])
train_op = opt.apply_gradients(grads_and_vars)
Is there a better way to do this?
The update rule that the apply_gradients method actually applies depends on the specific optimizer. Take a look at the implementation of apply_gradients in the tf.train.Optimizer class here. It relies on the derived classes implementing the update rule in the methods _apply_dense and _apply_spares. The update rule you are referring to is implemented by the GradientDescentOptimizer.
Regarding your desired positive additive update: If what you are calling opt is an instantiation of GradientDescentOptimizer, then you could indeed achieve what you want to do by
grads_and_vars = opt.compute_gradients(E, [v])
eta = opt._learning_rate
my_grads_and_vars = [(g-(1/eta)*p, v) for g, v in grads_and_vars]
opt.apply_gradients(my_grads_and_vars)
The more elegant way to do this is probably to write a new optimizer (inheriting from tf.train.Optimizer) that implements your desired update rule directly.
You can also use eager execution API.
import tensorflow as tf
tf.enable_eager_execution()
tfe = tf.contrib.eager
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
grad = tfe.implicit_gradients(loss)
optimizer.apply_gradients(grad(model_fn, val_list))
I will make an instance for it as follow:
import tensorflow as tf
tf.enable_eager_exeuction()
tfe = tf.contrib.eager
W = tfe.Variable(np.random.randn())
b = tfe.Variable(np.random.randn())
def linear_regression(inputs):
return inputs * W + b;
def MSE(model_fn, inputs, labels):
return tf.reduce_sum(tf.pow(model_fn(inputs) - labels, 2)) / (2 * n_samples)
optimizer = tf.train.GradientDescentOptimizer(learning_rate = 0.001)
grad = tfe.implicit_gradients(MSE)
optimizer.apply_gradients(grad(linear_regression, train_X, train_Y)) # train_X and train_Y are your input data and label

Interpreting Tensorflow/Tensorboard "subtraction" operation

The following is code adapted from a simple learning example, that I have bent out of shape to understand the Tensorboard graph visualizations:
import tensorflow as tf
import numpy as np
sess = tf.InteractiveSession()
# Create 100 phony x, y data points in NumPy, y = x * 0.1 + 0.3
x_data = np.random.rand(10).astype("float32")
y_data = x_data * 0.1 + 0.3
W = tf.Variable(tf.random_uniform([1], -1.0, 1.0, name = "internal_W"), name = "external_W")
b = tf.Variable(2*tf.zeros([1], name = "internal_b"), name = "doubled_b")
y = (W * x_data + b)
l1 = (y - y_data)
l2 = (y_data - y )
writer = tf.train.SummaryWriter("/tmp/test1", sess.graph_def)
init = tf.initialize_all_variables()
# Launch the graph.
sess = tf.Session()
sess.run(init)
print(sess.run(y))
print('---')
print((y_data))
print('---')
print(sess.run(l1))
print('---')
print(sess.run(l2))
A sample output of the print statements is:
[ 0.84253538 0.31011301 0.11627766 0.35491142 0.65550905 0.1798114
0.13632762 0.02010157 0.42960873 0.04218956]
---
[ 0.39195824 0.33384719 0.31269109 0.33873668 0.37154531 0.31962547
0.31487945 0.302194 0.3468895 0.30460477]
---
[ 0.45057714 -0.02373418 -0.19641343 0.01617473 0.28396374 -0.13981406
-0.17855182 -0.28209242 0.08271924 -0.2624152 ]
---
[-0.45057714 0.02373418 0.19641343 -0.01617473 -0.28396374 0.13981406
0.17855182 0.28209242 -0.08271924 0.2624152 ]
Clearly, the subtractions are working properly-- the inputs to the subtraction are in different order, and yield different outputs. However, the graph visualization is:
Notice the "Sub" operators, which appear not to reverse the order of the operands as the code does. (Highlighting either operator yields no additional insight.) Am I missing something obvious, or do the node visualizations completely obscure order of operands?
After futzing around with this, my considered answer to my own question is, "Yes, this is working as intended." The inputs to the nodes show only what the inputs are, not any particular relationships to the operation or the node or themselves; indeed, if one added a variable to itself in an operation node, the input variable would show up only once.
This is not a design choice I would have made, but that does seem to be the intent.
I still encourage others who may have more insight to comment or fully answer.