Why AdamOptimizer fails to find optimal value to minimize x*x? - tensorflow

I am trying to minimize x*x with adagrad optimiser. I expect to get x=0 as result, but I get value x, close to initial value.
import tensorflow as tf
x=tf.Variable(-2.)
sq=x*x
o = tf.train.AdamOptimizer(1e-1).minimize(sq)
with tf.Session() as sess:
init = tf.global_variables_initializer()
sess.run([init])
sess.run([o])
r=sess.run([x])
print("done",r)
I get -1.9 as a result, instead of expected 0.
Do I understand correctly that -2 is initial value here, or is it something else? Does AdamOptimiser perform just one step or is it possible to launch it for continious optimisation? How do I get x=0 as result?

sess.run([0]) runs only a single step. To perform a full optimization, you need to run many steps, which can be done by repeating the single step in a loop.
Thus, you can replace sess.run([o]) with:
for i in range(1000):
sess.run([o])
This yields the results 3.4735016e-23, very close to the expected 0.
In my experience, people usually run many optimization steps just as I demonstrated, with a for loop. If you are interested in implementing the loop as a TensorFlow operation, and then running this operation only once, this can be done, but it is not recommended. The reasons are: (a) I don't think you will gain any "elegance" in your code by doing this. (b) If you want to run 1000 steps, you will need to add 1000 sets of operations to your graph, and group them as one. Contrast this to needing only one set of operations.
You can see more relevant information in this question.

Related

Tensorflow error:The graph couldn't be sorted in topological order

When I run my loss function and it will be occur this error or warning.
I really can not figure out what cause it.
I guess that maybe I didn't use the origin input,for example:
def loss(predict,label):
#because some reason I need to extract some values in predict
predictProcessed = process(predict)
#predictProcessed is a subset of predict
loss = tf.square(predict - label)
return loss
My guess is right or not?
And I also use double for-loop in this code,Should the code use less for for-loop?thanks

Will using multiple minimizing ops at once work as expected in Tensorflow?

For example, if I do:
loss_one = something
loss_two = somthing_else
train_one = tf.train.AdamOptimzer(0.001).minimize(loss_one)
train_two = tf.train.AdamOptimizer(0.001).minimize(loss_two)
sess.run([train_one, train_two])
Will that do what's expected? The reason I'm concerned is because I don't exactly know how gradients are accumulated. Are they stored on the optimizers themselves? Or on the variables? If it's the second, I can imagine them interfering.
Most likely not. Presumably, both loss_one and loss_two are a measure of how close the output of your model, let's say out, is to what you expected, so they would both be a function of out and maybe something else. Both optimizers compute the variable updates from the out computed with the values that the variables had before calling session.run. So if you apply one update and then the other, the second update would not be really correct, because it has not been computed using the now-updated variables. This may not be a huge issue though, since. A more complicated problem is that, depending on how exactly the optimizer is implemented, if it is something more or less like variable = variable + update then it is not deterministic whether that variable on the right-hand side of the expression has the original or first-updated value, so you could end adding only one of the updates or both, non-deterministically.
There are several better alternatives:
Use only one optimizer at a time, so you call sess.run(train_one) first and sess.run(train_two) later.
Optimize the (possibly weighted) sum of both losses (tf.train.AdamOptimzer(0.001).minimize(loss_one + loss_two)).
Call compute_gradients from the optimizer for each loss value, combine the resulting gradients however you see fit (e.g. adding or averaging the updates) and apply them with apply_gradients.
Use tf.control_dependencies to make sure that one optimization step always takes place after the other. However this means that using the second optimizer will always require using the first one (could be work around, maybe with tf.cond, but it's more of a hassle).
the optimizer is mainly in charge of calculating the gradients(backpropagation), if you give it loss twice(run it two times as you are doing), it will update the gradients twice by performing inference once.not sure why would you do that though

Cache intermediate tensor and update periodically

I have a large tensor that is expensive to calculate, but realistically I only need to recalculate it every 10 iterations or so (during gradient descent). What's the best way to do this?
More specifically:
Suppose I have an intermediate_tensor that is used in the calculation of final_tensor each time the a tf.Session is run. final_tensor is, in my case, a set of modified gradients to use in optimization. It is possible to define a graph that contains both intermediate_tensor and final_tensor. However, running this graph will be inefficient when intermediate_tensor changes slowly. In pseudocode, this is what I'd like to do:
intermediate_tensor = tf.some_operation(earlier_variable)
final_tensor = tf.matmul(intermediate_tensor, other_earlier_variable)
with tf.Session() as sess:
# pretending `partial_run` works like I want it to:
sess.partial_run(intermediate_tensor, feed_dict = {})
for i in range(5):
ft = sess.partial_run(final_tensor, feed_dict = {})
print(ft)
The experimental partial_run feature is almost what I'm looking for. However, partial_run can only be used if I want to evaluate final_tensor just once for each time I evaluate intemediate_tensor. It won't work for a for loop.
My workaround for the moment is to use tf.placeholder. I evaluate intermediate_tensor in one call to sess.run, then feed the result to a new call of sess.run as a placeholder. However, this is very inflexible. It requires that I hardcode the variable shape at compile time, for example. It's also not very good when the number of intermediate variables I'd like to use is very large.
Is there a better way? This would be very helpful if, say, one were using a curvature matrix that doesn't need to be evaluated every iteration.

Force copy of tensor when enqueuing

first, I'm not sure if the title is very good, but it was the best I could come up with given my understanding of the situation.
The background is that I'm trying to understand how queues work in tensorflow and ran into the following issue which puzzled me.
I have a variable n, which I enqueue to a tf.FIFOQueue, and then I increment the variable. This is repeated several times, and one would expect a result similar to 0, 1, 2, ... However, when emptying the queue all values are the same.
More precisely, the code is as follows:
from __future__ import print_function
import tensorflow as tf
q = tf.FIFOQueue(10, tf.float32)
n = tf.Variable(0, trainable=False, dtype=tf.float32)
inc = n.assign(n+1)
enqueue = q.enqueue(n)
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)
sess.run(enqueue)
sess.run(inc)
sess.run(enqueue)
sess.run(inc)
sess.run(enqueue)
sess.run(inc)
print(sess.run(q.dequeue()))
print(sess.run(q.dequeue()))
print(sess.run(q.dequeue()))
Which I expect would print:
0.0
1.0
2.0
Instead I get the following result:
3.0
3.0
3.0
It seems like I'm pushing some pointer to n to the queue, instead of the actual value, which is what I want. However, I don't really have any actual understanding of tensorflow internals, so maybe something else is going on?
I tried changing
enqueue = q.enqueue(n)
to
enqueue = q.enqueue(tf.identity(n))
since answers to How can I copy a variable in tensorflow and In TensorFlow, what is tf.identity used for? gives me the impression that it might help, but it does not change the result. I also tried adding a tf.control_dependencies(), but again, all values are the same when dequeueing.
Edit: The output above is from running the code on a computer with a single CPU, when trying to see if there was some difference between different versions of tensorflow, I noticed if I run the code on a computer with CPU and GPU I get the "expected" result. Indeed, if I run with CUDA_VISIBLE_DEVICES="" I get the result above, and with CUDA_VISIBLE_DEVICES="0" I get the "expected" result.
To force a non-caching read you can do
q.enqueue(tf.add(q, 0))
This is what's currently done by the batch-normalization layer to force a copy.
Semantics of how variables get read vs. referenced are in the process of getting revamped so they are temporarily non-intuitive. In particular, I expected q.enqueue(v.read_value()) to force a non-caching read, but it doesn't fix your example on TF 0.12rc1
Using GPU machine puts variable on GPU, while Queue is CPU only, so enqueue op forces a GPU->CPU copy.
In case it helps, I've found that the other answers despite correct they do not work for all dtypes.
For example, this works fine with floats or ints but fails when n is a string tensor:
q.enqueue(tf.add(n, 0))
This one fails when the queue uses tuples with heterogeneous types (e.g., ints and floats):
q.enqueue_many([[n]])
So, if you see yourself caught in any of these situations try this instead:
q.enqueue(tf.add(n, tf.zeros_like(n)))
Or, to enqueue a tuple t:
q.enqueue([tf.add(n, tf.zeros_like(n)) for n in t])
That works even for string tensors and heterogeneous tuple types.
Hope it helps!
--
Update: it looks like tf.bool types do not work with tf.zeros_like(). For those, an explicit cast to an integer type might be needed.

Tensorflow: opt.compute_gradients() returns values different from the weight difference of opt.apply_gradients()

Question: What is the most efficient way to get the delta of my weights in the most efficient way in a TensorFlow network?
Background: I've got the operators hooked up as follows (thanks to this SO question):
self.cost = `the rest of the network`
self.rmsprop = tf.train.RMSPropOptimizer(lr,rms_decay,0.0,rms_eps)
self.comp_grads = self.rmsprop.compute_gradients(self.cost)
self.grad_placeholder = [(tf.placeholder("float", shape=grad[1].get_shape(), name="grad_placeholder"), grad[1]) for grad in self.comp_grads]
self.apply_grads = self.rmsprop.apply_gradients(self.grad_placeholder)
Now, to feed in information, I run the following:
feed_dict = `training variables`
grad_vals = self.sess.run([grad[0] for grad in self.comp_grads], feed_dict=feed_dict)
feed_dict2 = `feed_dict plus gradient values added to self.grad_placeholder`
self.sess.run(self.apply_grads, feed_dict=feed_dict2)
The command of run(self.apply_grads) will update the network weights, but when I compute the differences in the starting and ending weights (run(self.w1)), those numbers are different than what is stored in grad_vals[0]. I figure this is because the RMSPropOptimizer does more to the raw gradients, but I'm not sure what, or where to find out what it does.
So back to the question: How do I get the delta on my weights in the most efficient way? Am I stuck running self.w1.eval(sess) multiple times to get the weights and calc the difference? Is there something that I'm missing with the tf.RMSPropOptimizer function.
Thanks!
RMSprop does not subtract the gradient from the parameters but use more complicated formula involving a combination of:
a momentum, if the corresponding parameter is not 0
a gradient step, rescaled non uniformly (on each coordinate) by the square root of the squared average of the gradient.
For more information you can refer to these slides or this recent paper.
The delta is first computed in memory by tensorflow in the slot variable 'momentum' and then the variable is updated (see the C++ operator).
Thus, you should be able to access it and construct a delta node with delta_w1 = self.rmsprop.get_slot(self.w1, 'momentum'). (I have not tried it yet.)
You can add the weights to the list of things to fetch each run call. Then you can compute the deltas outside of TensorFlow since you will have the iterates. This should be reasonably efficient, although it might incur an extra elementwise difference, but to avoid that you might have to hack around in the guts of the optimizer and find where it puts the update before it applies it and fetch that each step. Fetching the weights each call shouldn't do wasteful extra evaluations of part of the graph at least.
RMSProp does complicated scaling of the learning rate for each weight. Basically it divides the learning rate for a weight by a running average of the magnitudes of recent gradients of that weight.