get_weights is slow with every iteration - tensorflow

I'm computing gradients from a private network and applying them to another master network. Then I'm copying the weights for the master to the private (it sounds redundant but bear with me). The problem is that with every iteration get_weights becomes slower and I even run out of memory.
def work(self, session):
with session.as_default(), session.graph.as_default():
self.private_net = ACNetwork()
state = self.env.reset()
while counter<TOTAL_TR_STEPS:
action_index, action_vector = self.get_action(state)
next_state, reward, done, info = self.env.step(action_index)
....# store the new data : reward, state etc...
if done == True:
# end of episode
state = self.env.reset()
a_grads, c_grads = self.private_net.get_gradients()
self.master.update_from_gradients(a_grads, c_grads)
self._update_worker_net() #this is the slow one
!!!!!!
This is the function that uses get_weights.
def _update_worker_net(self):
self.private_net.actor_t.set_weights(\
self.master.actor_t.get_weights())
self.private_net.critic.set_weights(\
self.master.critic.get_weights())
return
Looking around I found a post that suggested using
K.clear_session()
at the end of the while block (at the !!!!!! segment) because somehow new nodes are being added (?!) at the graph. But that onle returned an error:
AssertionError: Do not use tf.reset_default_graph() to clear nested graphs. If you need a cleared graph, exit the nesting and create a new graph.
Is there a faster way to transfer weights? Is there a way to not add new nodes (if that is what is indeed happening?)

This would typically happen when you dynamically add new nodes to the graph. Example situation:
while True:
grad_op = optimizer.get_gradients()
session.run([gradients])
Where get_gradients will add new operations to the graph. Operations returned by get_gradients would not change regardless of how many times you call it, therefore a single call should be enough. The correct way to rewrite it would be:
grad_op = optimizer.get_gradients()
while True:
session.run([gradients])
Something like that is probably happening in your code. Try to make sure that you dont construct new operations within your while loop.

Related

How to assign value to one graph with the other graph who has the same structure in tensorflow?

I'm trying to implement DQN in tensorflow. Here I have one target network and one training network who have the same structure with each other. In the beginning of every 10000 training steps, I want to load the value from checkpoint to target network and training network, then stop_gradient target network. However, I tried those ways, and none of them worked:
1, Put the two networks in one graph. However, every time I load them, I don't know how to assign the value of training network part to target network part.(They are saved in different values, since one is stop gradient.)
2, Define two graphs using tf.graph() and run two session respectively. However, I can't load the checkpoint of one graph to another, even they have the same structure. After all, they are two different graphs.
So, any one who can give me some advice? Very appreciated!
The typical approach would be to put everything in one graph, put your two networks in two name scopes, and then create tf.assign ops for each variable in one scope to the another and use the tf.group to construct a final "copying" operation. Lets assume that function create_net() builds a single network
with tf.name_scope('main_network'):
main_net = create_net()
with tf.name_scope('target_network):
target_network = create_net()
main_variables = tf.get_collection(tf.GraphKeys.VARIABLES, scope='main_network')
target_variables = tf.get_collection(tf.GraphKeys.VARIABLES, scope='target_network')
# I am assuming get_collection returns variables in the same order, please double
# check this is actually happening
assign_ops = []
for main_var, target_var in zip(main_variables, target_variables):
assign_ops.append(tf.assign(target_var, tf.identity(main_var)))
copy_operation = tf.group(*assign_ops)
Now executing copy_operation in session.run should copy your main network parameters to the target network. The above code should be considered a pseudo code, rather than something you can copy&paste.

feed data into a tf.contrib.data.Dataset like a queue

About the tf.contrib.data.Dataset (from TensorFlow 1.2, see here and here) usage:
The way how to get data doesn't really fit any way how I get the data usually. In my case, I have a thread and I receive data there and I don't know in advance when it will end but I see when it ends. Then I wait until I processed all the buffers and then I have finished one epoch. How can I get this logic with the Dataset?
Note that I prefer the Dataset interface over the QueueBase interface because it gives me the iterator interface which I can reinitialize and even reset to a different Dataset. This is more powerful compared to queues which cannot be reopened currently after they are closed (see here and here).
Maybe a similar question, or the same question: How can I wrap around a Dataset over a queue? I have some thread with reads some data from somewhere and which can feed it and queue it somehow. How do I get the data into the Dataset? I could repeat some dummy tensor infinite times and then use map to just return my queue.dequeue() but that really only gets me back to all the original problems with the queue, i.e. how to reopen the queue.
The new Dataset.from_generator() method allows you to define a Dataset that is fed by a Python generator. (To use this feature at present, you must download a nightly build of TensorFlow or build it yourself from source. It will be part of TensorFlow 1.4.)
The easiest way to implement your example would be to replace your receiving thread with a generator, with pseudocode as follows:
def receiver():
while True:
next_element = ... # Receive next element from external source.
# Note that this method may block.
end_of_epoch = ... # Decide whether or not to stop based on next_element.
if not end_of_epoch:
yield next_element # Note: you may need to convert this to an array.
else:
return # Returning will signal OutOfRangeError on downstream iterators.
dataset = tf.contrib.data.Dataset.from_generator(receiver, output_types=...)
# You can chain other `Dataset` methods after the generator. For example:
dataset = dataset.prefetch(...) # This will start a background thread
# to prefetch elements from `receiver()`.
dataset = dataset.repeat(...) # Note that each repetition will call
# `receiver()` again, and start from
# a fresh state.
dataset = dataset.batch(...)
More complicated topologies are possible. For example, you can use Dataset.interleave() to create many receivers in parallel.

When should we use the place_pruned_graph config?

Question:As the title saied, I wander When should we use the config place_pruned_graph in GraphOptions. What's the purpose of this config?
I'm not clear to the comment about this config:
// Only place the subgraphs that are run, rather than the entire graph.
//
// This is useful for interactive graph building, where one might
// produce graphs that cannot be placed during the debugging
// process. In particular, it allows the client to continue work in
// a session after adding a node to a graph whose placement
// constraints are unsatisfiable.
We know that Tensorflow will partition a entire graph into several subgraphs in normal. And the following code from CreateGraphs of direct_session.cc takes the else branch in normal.(as far as I can see, I never found the case taking the if branch(so I don't know when should we trigger it).
if (options_.config.graph_options().place_pruned_graph()) {
// Because we are placing pruned graphs, we need to create a
// new SimpleGraphExecutionState for every new unseen graph,
// and then place it.
SimpleGraphExecutionStateOptions prune_options;
prune_options.device_set = &device_set_;
prune_options.session_options = &options_;
prune_options.stateful_placements = stateful_placements_;
TF_RETURN_IF_ERROR(SimpleGraphExecutionState::MakeForPrunedGraph(
execution_state_->original_graph_def().library(), prune_options,
execution_state_->original_graph_def(), subgraph_options,
&temp_exec_state_holder, &client_graph));
execution_state = temp_exec_state_holder.get();
} else {
execution_state = execution_state_.get();
TF_RETURN_IF_ERROR(
execution_state->BuildGraph(subgraph_options, &client_graph));
}
The short answer? Never. The longer answer requires me to explain why this option exists at all.
So why does TensorFlow include this convoluted configuration option and logic to handle it? It's a historical accident that came about when tensorflow::DirectSession and tensorflow::GrpcSession had different internal implementations:
The tensorflow::GrpcSession used a single SimpleGraphExecutionState for the entire graph in a session. The net effect of this was that the placer—which is responsible for assigning devices to each node in the graph—would run before the graph was pruned.
The tensorflow::DirectSession originally used one SimpleGraphExecutionState for each pruned subgraph, with some special logic for sharing the placements of stateful nodes between invocations. Therefore the placer would run after the graph was pruned, and could make different decisions about where to place stateful nodes.
The benefit of the tensorflow::GrpcSession approach (place_pruned_graph = false) is that it takes into account all of the colocation constraints in the graph when running the placement algorithm, even if they don't occur in the subgraph being executed. For example, if you had an embedding matrix, and wanted to optimize it using the SparseApplyAdagrad op (which only has a CPU implementation), TensorFlow would figure out that the embedding matrix should be placed on CPU.
By contrast, if you specified no device for the embedding matrix and set placed_pruned_graph = true the matrix would (most likely) be placed on GPU when you ran its initializer, because all of the ops in the initialization subgraph would be runnable on GPU. And, since variables cannot move between devices, TensorFlow would not be able to issue the subgraph that ran SparseApplyAdagrad on the matrix. This was a real issue in the earliest version of TensorFlow.
So why support place_pruned_graph = true at all? It turns out that it is useful when using TensorFlow interactively. The placed_pruned_graph = false option is unforgiving: once the graph for a session contains a node that cannot be placed, that session is useless, because the placement algorithm runs on the whole graph, it would fail every time it is invoked, and therefore no steps could run. When you use a tf.InteractiveSession, we assume that you are using a REPL (or Jupyter notebook) and that it's beneficial to allow you to continue after making such a mistake. Therefore in a tf.InteractiveSession we set place_pruned_graph = true so that you can continue to use the session after adding an unplaceable node (as long as you don't try to run that node in a pruned subgraph).
There is probably a better approach than place_pruned_graph = true for interactive use, but we haven't investigated adding one. Suggestions are always welcome on the GitHub issues page.

In dynamic_rnn() is it valid to include variables in the state?

I'm implementing a RNN cell around BasicLSTMCell where I want to be able to look back on past hidden states (across batch boundaries). I'm using dynamic_rnn() and the basic pattern I use is:
def __call__(self, inputs, old_state, scope=None):
mem = old_state[2]
# [do something with with mem]
cell_out, new_state = self.cell(inputs,
(old_state[0],
old_state[1]))
h_state = new_state.h
c_state = new_state.c
# control dependency required because of self.buf_index
with tf.get_default_graph().control_dependencies([cell_out]):
new_mem = write_to_buf(self.out_buf,
cell_out,
self.buf_index)
# update the buffer index
with tf.get_default_graph().control_dependencies(new_mem):
inc_step = tf.assign(self.buf_index, (self.buf_index + 1) %
self.buf_size)
with tf.get_default_graph().control_dependencies([inc_step]):
h_state = tf.identity(h_state)
t = [c_state, h_state, new_mem]
return cell_out, tuple(t)
self.buf and self.buf_index are variables. write_to_buf() is a function that uses scatter_update() to write the new hidden states to the buffer and returns the result.
I rely on the assumption that accesses to scatter updates return value guarantee that the new variable value is used (similar to this) so that caching of variables does not mess things up.
From debug prints it seems to work but it would be nice to get some confirmation or suggestions on alternatives.

Can cond support TF ops with side effects?

The (source code) documentation for tf.cond is unclear on whether the functions to be performed when the predicate is evaluated can have side effects or not. I've done some tests but I'm getting conflicting results. For example the code below does not work:
import tensorflow as tf
from tensorflow.python.ops import control_flow_ops
pred = tf.placeholder(tf.bool, [])
count = tf.Variable(0)
adder = count.assign_add(1)
subtractor = count.assign_sub(2)
my_op = control_flow_ops.cond(pred, lambda: adder, lambda: subtractor)
sess = tf.InteractiveSession()
tf.initialize_all_variables().run()
my_op.eval(feed_dict={pred: True})
count.eval() # returns -1
my_op.eval(feed_dict={pred: False})
count.eval() # returns -2
I.e. no matter what value the predicate evaluates to, both functions are getting run, and so the net result is a subtraction of 1. On the other hand, this code snippet does work, where the only difference is that I add new ops to the graph every time my_op is called:
pred = tf.placeholder(tf.bool, [])
count = tf.Variable(0)
my_op = control_flow_ops.cond(pred, lambda:count.assign_add(1), lambda:count.assign_sub(2))
sess = tf.InteractiveSession()
tf.initialize_all_variables().run()
my_op.eval(feed_dict={pred: False})
count.eval() # returns -2
my_op.eval(feed_dict={pred: True})
count.eval() # returns -1
Not sure why creating new ops every time works while the other case doesn't, but I'd obviously rather not be adding nodes as the graph will eventually become too big.
Your second version—where the assign_add() and assign_sub() ops are creating inside the lambdas passed to cond()—is the correct way to do this. Fortunately, each of the two lambdas is only evaluated once, during the call to cond(), so your graph will not grow without bound.
Essentially what cond() does is the following:
Create a Switch node, which forwards its input to only one of two outputs, depending on the value of pred. Let's call the outputs pred_true and pred_false. (They have the same value as pred but that's unimportant since this is never directly evaluated.)
Build the subgraph corresponding to the if_true lambda, where all of the nodes have a control dependency on pred_true.
Build the subgraph corresponding to the if_false lambda, where all of the nodes have a control dependency on pred_false.
Zip together the lists of return values from the two lambdas, and create a Merge node for each of these. A Merge node takes two inputs, of which only one is expected to be produced, and forwards it to its output.
Return the tensors that are the outputs of the Merge nodes.
This means you can run your second version, and be content that the graph remains a fixed size, regardless of how many steps you run.
The reason your first version doesn't work is that, when a Tensor is captured (like adder or subtractor in your example), an additional Switch node is added to enforce the logic that the value of the tensor is only forwarded to the branch that actually executes. This is an artifact of how TensorFlow combines feed-forward dataflow and control flow in its execution model. The result is that the captured tensors (in this case the results of the assign_add and assign_sub) will always be evaluated, even if they aren't used, and you'll see their side effects. This is something we need to document better, and as Michael says, we're going to make this more usable in future.
The second case works because you have added the ops within the cond: this causes them to conditionally execute.
The first case it is analogous to saying:
adder = (count += 1)
subtractor = (count -= 2)
if (cond) { adder } else { subtractor }
Since adder and subtractor are outside the conditional, they are always executed.
The second case is more like saying
if (cond) { adder = (count += 1) } else { subtractor = (count -= 2) }
which in this case does what you expected.
We realize that the interaction between side effects and (somewhat) lazy evaluation is confusing, and we have a medium-term goal to make things more uniform. But the important thing to understand for now is that we do not do true lazy evaluation: the conditional acquires a dependency on every quantity defined outside the conditional that is used within either branch.