I read the docs of sess.as_default()
N.B. The default session is a property of the current thread. If you create a new thread, and wish to use the default session in that thread, you must explicitly add a with sess.as_default(): in that thread's function.
My understanding is that if there are two more sessions when a new thread is created, we must set a session to run TensorFlow code in it. So, to do this, a session is chosen and as_default() is called.
N.B. Entering a with sess.as_default(): block does not affect the current default graph. If you are using multiple graphs, and sess.graph is different from the value of tf.get_default_graph, you must explicitly enter a with sess.graph.as_default(): block to make sess.graph the default graph.
In sess.as_default() block, to call a specific graph, one must call sess.graph.as_default() to run the graph?
The tf.Session API mentions that a graph is launched in a session. The following code illustrates this:
import tensorflow as tf
graph1 = tf.Graph()
graph2 = tf.Graph()
with graph1.as_default() as graph:
a = tf.constant(0, name='a')
graph1_init_op = tf.global_variables_initializer()
with graph2.as_default() as graph:
a = tf.constant(1, name='a')
graph2_init_op = tf.global_variables_initializer()
sess1 = tf.Session(graph=graph1)
sess2 = tf.Session(graph=graph2)
sess1.run(graph1_init_op)
sess2.run(graph2_init_op)
# Both tensor names are a!
print(sess1.run(graph1.get_tensor_by_name('a:0'))) # prints 0
print(sess2.run(graph2.get_tensor_by_name('a:0'))) # prints 1
with sess1.as_default() as sess:
print(sess.run(sess.graph.get_tensor_by_name('a:0'))) # prints 0
with sess2.as_default() as sess:
print(sess.run(sess.graph.get_tensor_by_name('a:0'))) # prints 1
with graph2.as_default() as g:
with sess1.as_default() as sess:
print(tf.get_default_graph() == graph2) # prints True
print(tf.get_default_session() == sess1) # prints True
# This is the interesting line
print(sess.run(sess.graph.get_tensor_by_name('a:0'))) # prints 0
print(sess.run(g.get_tensor_by_name('a:0'))) # fails
print(tf.get_default_graph() == graph2) # prints False
print(tf.get_default_session() == sess1) # prints False
You don't need to call sess.graph.as_default() to run the graph, but you need to get the correct tensors or operations in the graph to run it. The context allows you to get the graph or session using tf.get_default_graph or tf.get_default_session.
In the interesting line above, the default session is sess1 and it is implicitly calling sess1.graph, which is the graph in sess1, which is graph1, and hence it prints 0.
In the line following that, it fails because it is trying to run an operation in graph2 with sess1.
Related
How can I add the number 5 after every iteration of the loop?
I want to do something like this:
weight = 0.225
for i in range(10):
weight += 5
print (weight)
Here is how I am trying in tensorflow but it never updates the weight
import tensorflow as tf
def dummy(x):
weights['h0'] = tf.add(weights['h0'], 5)
res = tf.add(weights['h0'], x)
return res
# build computational graph
a = tf.placeholder('float', None)
d = dummy(a)
weights = {
'h0': tf.Variable(tf.random_normal([1]))
}
# initialize variables
init = tf.global_variables_initializer()
# create session and run the graph
with tf.Session() as sess:
sess.run(init)
for i in range(10):
print (sess.run(d, feed_dict={a: [2]}))
# close session
sess.close()
There's an operation explicitly created for adding a value and assigning the result back to the input node: tf.assign_add
You should use it instead of tf.assing + tf.add.
Also, it's more important that you understand why you previous code won't work.
weights['h0'] = tf.add(weights['h0'], 5)
res = tf.add(weights['h0'], x)
At the fist line, you're defining a node add, whose inputs are weights['h0'] and 5 and you're assigning this node to a python variable weights['h0'].
Now, thus, weights['h0'] is a python variable holding a tensorflow node.
In the next line, you're defining another add node, between the previous node and x, and you return this node.
When the graph is evaluated, you evaluate the node pointed by res, that force the evaluation of the previous node (because res is a function of the node holded by weights['h0']).
The problem is the that your assignment at line 1 is a python assignment and not a tensorflow assignment.
Thus that assign operation is executed only in the python environment but it has no defined an assign node into the tensorflow graph.
P.S: when you use with you're defining a context manager that handles the closing operations for you. You can thus remove sess.close() because is executed automatically when you exit from that context
Apparently there is an assign operator
https://www.tensorflow.org/api_docs/python/tf/assign
weights['h0'] = tf.assign(weights['h0'], tf.add(weights['h0'], 5))
I am trying to get my tensorflow code profile (running and memory consumption of each layers in the network) by following the runtime statistics instruction here. As far as I understand, I need to create run options and run metadata like this
run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
run_metadata = tf.RunMetadata()
and pass them to sess.run
However, as I am also trying to use tf.train.MonitoredTrainingSession I don't know if I can pass the same thing into this class. A plausible approach could make use of Hooks but I do not know how to do it. I am still very new to them
You can simply create a custom hook and pass it to the MonitoredTrainingSession. There is no need to pass your own tf.RunMetadata() instance to the run call.
Here is an example Hook which stores metadata every N steps to ckptdir:
import tensorflow as tf
class TraceHook(tf.train.SessionRunHook):
"""Hook to perform Traces every N steps."""
def __init__(self, ckptdir, every_step=50, trace_level=tf.RunOptions.FULL_TRACE):
self._trace = every_step == 1
self.writer = tf.summary.FileWriter(ckptdir)
self.trace_level = trace_level
self.every_step = every_step
def begin(self):
self._global_step_tensor = tf.train.get_global_step()
if self._global_step_tensor is None:
raise RuntimeError("Global step should be created to use _TraceHook.")
def before_run(self, run_context):
if self._trace:
options = tf.RunOptions(trace_level=self.trace_level)
else:
options = None
return tf.train.SessionRunArgs(fetches=self._global_step_tensor,
options=options)
def after_run(self, run_context, run_values):
global_step = run_values.results - 1
if self._trace:
self._trace = False
self.writer.add_run_metadata(run_values.run_metadata,
f'{global_step}', global_step)
if not (global_step + 1) % self.every_step:
self._trace = True
It checks in before_run whether it has to trace or not and if so, adds the RunOptions. In after_run it checks if the next run call needs to be traced and if so, it sets _trace to True again. Additionally it stores the metadata when it is available.
I have trained a model in TensorFlow and now I would like to visualize which inputs maximally activate an output. I'd like to know what the cleanest way to do this is.
I had thought to do this by creating a trainable input variable which I can assign once per run. Then by using an appropriate loss function and using an optimizer with a var_list containing just this input variable I would update this input variable until convergence. i.e.
trainable_input = tf.get_variable(
'trainable_input',
shape=data_op.get_shape(),
dtype=data_op.dtype,
initializer=tf.zeros_initializer(),
trainable=True,
collections=[tf.GraphKeys.LOCAL_VARIABLES])
trainable_input_assign_op = tf.assign(trainable_input, data_op)
data_op = trainable_input
# ... run the rest of the graph building code here, now with a trainable input
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
# loss_op is defined on one of the outputs
train_op = optimizer.minimize(loss_op, var_list=[trainable_input])
However, when I do this I run into issues. If I try to restore the pre-trained graph using a Supervisor, then it naturally complains that the new variables created by the AdamOptimizer do not exist in the graph I'm trying to restore. I can remedy this by using get_slots to get the variables the AdamOptimizer creates and manually adding those variables to the tf.GraphKeys.LOCAL_VARIABLES collection, but it feels pretty hacky and I'm not sure what the consequences of this would be. I can also exclude those variables explicitly from the Saver that is passed to the Supervisor without adding them to the tf.GraphKeys.LOCAL_VARIABLES collection, but then I get an exception that they do not get properly initialized by the Supervisor:
File "/usr/local/lib/python3.5/site-packages/tensorflow/python/training/supervisor.py", line 973, in managed_session
self.stop(close_summary_writer=close_summary_writer)
File "/usr/local/lib/python3.5/site-packages/tensorflow/python/training/supervisor.py", line 801, in stop
stop_grace_period_secs=self._stop_grace_secs)
File "/usr/local/lib/python3.5/site-packages/tensorflow/python/training/coordinator.py", line 386, in join
six.reraise(*self._exc_info_to_raise)
File "/usr/local/lib/python3.5/site-packages/six.py", line 686, in reraise
raise value
File "/usr/local/lib/python3.5/site-packages/tensorflow/python/training/supervisor.py", line 962, in managed_session
start_standard_services=start_standard_services)
File "/usr/local/lib/python3.5/site-packages/tensorflow/python/training/supervisor.py", line 719, in prepare_or_wait_for_session
init_feed_dict=self._init_feed_dict, init_fn=self._init_fn)
File "/usr/local/lib/python3.5/site-packages/tensorflow/python/training/session_manager.py", line 280, in prepare_session
self._local_init_op, msg))
RuntimeError: Init operations did not make model ready. Init op: init, init fn: None, local_init_op: name: "group_deps_5"
op: "NoOp"
input: "^init_1"
input: "^init_all_tables"
, error: Variables not initialized: trainable_input/trainable_input/Adam, trainable_input/trainable_input/Adam_1
I'm not really sure why these variables are not getting initialized since I have used that technique before to exclude some variables from the restore process (GLOBAL and LOCAL) and they seem to get initialized as expected.
In short, my question is whether there is a simple way to add an optimizer to the graph and do a checkpoint restore (where the checkpoint does not contain the optimizer variables) without having to muck around with the internals of the optimizer. If that's not possible, then is there any downside to just adding the optimizer variables to the LOCAL_VARIABLES collection?
The same error occurs when I use slim library. In fact, the slim.learning.train() uses tf.train.Supervisor inside. I hope my answer on this GitHub issue may help your Supervisor problem.
I have the same problem with you. I solve it by doing following two steps.
1. pass the parameter saver to slim.learning.train()
ckpt = tf.train.get_checkpoint_state(FLAGS.train_dir)
saver = tf.train.Saver(var_list=optimistic_restore_vars(ckpt.model_checkpoint_path) if ckpt else None)
where function optimistic_restore_vars is defined as
def optimistic_restore_vars(model_checkpoint_path):
reader = tf.train.NewCheckpointReader(model_checkpoint_path)
saved_shapes = reader.get_variable_to_shape_map()
var_names = sorted([(var.name, var.name.split(':')[0]) for var in tf.global_variables() if var.name.split(':')[0] in saved_shapes])
restore_vars = []
name2var = dict(zip(map(lambda x:x.name.split(':')[0], f.global_variables()), tf.global_variables()))
with tf.variable_scope('', reuse=True):
for var_name, saved_var_name in var_names:
curr_var = name2var[saved_var_name]
var_shape = curr_var.get_shape().as_list()
if var_shape == saved_shapes[saved_var_name]:
restore_vars.append(curr_var)
return restore_vars
```
2. pass the parameter local_init_op to slim.learning.train() to initialize the added new variables
local_init_op = tf.global_variables_initializer()
In last, the code should look like this
ckpt = tf.train.get_checkpoint_state(FLAGS.train_dir)
saver = tf.train.Saver(var_list=optimistic_restore_vars ckpt.model_checkpoint_path) if ckpt else None)
local_init_op = tf.global_variables_initializer()
###########################
# Kicks off the training. #
###########################
learning.train(
train_tensor,
saver=saver,
local_init_op=local_init_op,
logdir=FLAGS.train_dir,
master=FLAGS.master,
is_chief=(FLAGS.task == 0),
init_fn=_get_init_fn(),
summary_op=summary_op,
number_of_steps=FLAGS.max_number_of_steps,
log_every_n_steps=FLAGS.log_every_n_steps,
save_summaries_secs=FLAGS.save_summaries_secs,
save_interval_secs=FLAGS.save_interval_secs,
sync_optimizer=optimizer if FLAGS.sync_replicas else None
)
How does one tell a tf.train.MonitoredTrainingSession to restore only a subset of the variables, and perform intialization on the rest?
Starting with the cifar10 tutorial ..
https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10/cifar10_train.py
.. I created lists of the variables to restore and initialize, and specified them using a Scaffold that I pass to the MonitoredTrainingSession:
restoration_saver = Saver(var_list=restore_vars)
restoration_scaffold = Scaffold(init_op=variables_initializer(init_vars),
ready_op=constant([]),
saver=restoration_saver)
but this gives the following error:
RuntimeError: Init operations did not make model ready for local_init. Init op: group_deps, init fn: None, error: Variables not initialized: conv2a/T, conv2b/T, [...]
.. where the uninitialized variables listed in the error message are the variables in my "init_vars" list.
The exception is raised by SessionManager.prepare_session(). The source code for that method seems to indicate that if the session is restored from a checkpoint, then the init_op is not run. So it looks like you can either have restored variables or initialized variables, but not both.
OK so as I suspected, I got what I wanted by implementing a new RefinementSessionManager class based on the existing tf.training.SessionManager. The two classes are almost identical, except I modified the prepare_session method to call the init_op regardless of whether the model was loaded from a checkpoint.
This allows me to load a list of variables from the checkpoint and initialize the remaining variables in the init_op.
My prepare_session method is this:
def prepare_session(self, master, init_op=None, saver=None,
checkpoint_dir=None, wait_for_checkpoint=False,
max_wait_secs=7200, config=None, init_feed_dict=None,
init_fn=None):
sess, is_loaded_from_checkpoint = self._restore_checkpoint(
master,
saver,
checkpoint_dir=checkpoint_dir,
wait_for_checkpoint=wait_for_checkpoint,
max_wait_secs=max_wait_secs,
config=config)
# [removed] if not is_loaded_from_checkpoint:
# we still want to run any supplied initialization on models that
# were loaded from checkpoint.
if not is_loaded_from_checkpoint and init_op is None and not init_fn and self._local_init_op is None:
raise RuntimeError("Model is not initialized and no init_op or "
"init_fn or local_init_op was given")
if init_op is not None:
sess.run(init_op, feed_dict=init_feed_dict)
if init_fn:
init_fn(sess)
# [...]
Hope this helps somebody else.
The hint from #avital works, to be more complete: pass a scaffolding object into MonitoredTrainingSession with a local_init_op and a ready_for_local_init_op. Like so:
model_ready_for_local_init_op = tf.report_uninitialized_variables(
var_list=var_list)
model_init_tmp_vars = tf.variables_initializer(var_list)
scaffold = tf.train.Scaffold(saver=model_saver,
local_init_op = model_init_tmp_vars,
ready_for_local_init_op = model_ready_for_local_init_op)
with tf.train.MonitoredTrainingSession(...,
scaffold=scaffold,
...) as mon_sess:
...
You can solve this with the local_init_op argument, which does get run after loading from a checkpoint.
Scaffold 's arguments contain following:
init_op
ready_op
local_init_op
ready_for_local_init_op
init_op will only be called when we do NOT restore from a checkpoint.
if not is_loaded_from_checkpoint:
if init_op is None and not init_fn and self._local_init_op is None:
raise RuntimeError("Model is not initialized and no init_op or "
"init_fn or local_init_op was given")
if init_op is not None:
sess.run(init_op, feed_dict=init_feed_dict)
if init_fn:
init_fn(sess)
So actually init_op can not help here. If you can write a new SessionManager, you can follow #user550701. We can also use local_init_op, but it may be a little tricky in distributed situations.
Scaffold will generate default init_op and local_init_op for us: Details here
init_op: will initialize tf.global_variables
local_init_op: will initialize tf.local_variables
We should initialize our variables and do not break the default mechanism at the same time.
One worker situation
You can create local_init_op like this:
target_collection = [] # Put your target tensors here
collection = tf.local_variables() + target_collection
local_init_op = tf.variables_initializer(collection)
ready_for_local_init_op = tf.report_uninitialized_variables(collection)
Distributed situation
We should take care of duplicate initialization of our target_collection because local_init_op will be called multiple times on multiple workers. If variables are local, it makes no difference. If they are global variables, we should make sure that it only be initialized once. To solve the duplicate problem, we can manipulate the collection variable. On chief worker, it includes both local variables and our target_collection. While for non-chief worker, we only put local variables into it.
if is_chief:
collection = tf.local_variables() + target_collection
else:
collection = tf.local_variables()
All in all, it is a little tricky, but we do not have to hack into tensorflow.
I had encountered the same problem, and my solution is
checkpoint_restore_dir_for_monitered_session = None
scaffold = None
if params.restore:
checkpoint_restore_dir_for_monitered_session = checkpoint_save_dir
restore_exclude_name_list = params.restore_exclude_name_list
if len(restore_exclude_name_list) != 0:
variables_to_restore, variables_dont_restore = get_restore_var_list(restore_exclude_name_list)
saver_for_restore = tf.train.Saver(var_list=variables_to_restore, name='saver_for_restore')
ready_for_local_init_op = tf.report_uninitialized_variables(variables_to_restore.values())
local_init_op = tf.group([
tf.initializers.local_variables(),
tf.initializers.variables(variables_dont_restore)
])
scaffold = tf.train.Scaffold(saver=saver_for_restore,
ready_for_local_init_op=ready_for_local_init_op,
local_init_op=local_init_op)
with tf.train.MonitoredTrainingSession(
checkpoint_dir=checkpoint_restore_dir_for_monitered_session,
save_checkpoint_secs=None, # don't save ckpt
hooks=train_hooks,
config=config,
scaffold=scaffold,
summary_dir=params.log_dir) as sess:
pass
In this code fragment, get_restore_var_list gets variables_to_restore and variables_dont_restore.
saver_for_restore only restores variables in variables_to_restore, which are checked and pass through by ready_for_local_init_op after that.
Then local_init_op will run, which initialize local_variables() and variables_dont_restore (maybe tf.variance_scaling_initializer...).
In order to train a model I have encapsulated my model in a class.
I use a tf.RandomShuffleQueue to enqueue a list of filenames to.
However when I dequeue the elements they get dequeued but the size of the queue does not reduce.
Following are more specific questions followed by the code snippet :
If I have only 5 images for example, but steps range upto 100, would this result in the addfilenames called repeatedly automatically ? It does not give me any error on dequeuing so I am thinking that it is getting called automatically.
Why the size of the tf.RandomShuffleQueue is not changing ? It remains constant.
import os
import time
import functools
import tensorflow as tf
from Read_labelclsloc import readlabel
def ReadTrain(traindir):
# Returns a list of training images, their labels and a dictionay.
# The dictionary maps label names to integer numbers.
return trainimgs, trainlbls, classdict
def ReadVal(valdir, classdict):
# Reads the validation image labels.
# Returns a dictionary with filenames as keys and
# corresponding labels as values.
return valdict
def lazy_property(function):
# Just a decorator to make sure that on repeated calls to
# member functions, ops don't get created repeatedly.
# Acknowledgements : https://danijar.com/structuring-your-tensorflow-models/
attribute= '_cache_' + function.__name__
#property
#functools.wraps(function)
def decorator(self):
if not hasattr(self, attribute):
setattr(self, attribute, function(self))
return getattr(self, attribute)
return decorator
class ModelInitial:
def __init__(self, traindir, valdir):
self.graph
self.traindir = traindir
self.valdir = valdir
self.traininginfo()
self.epoch = 0
def traininginfo(self):
self.trainimgs, self.trainlbls, self.classdict = ReadTrain(self.traindir)
self.valdict = ReadVal(self.valdir, self.classdict)
with self.graph.as_default():
self.trainimgs_tensor = tf.constant(self.trainimgs)
self.trainlbls_tensor = tf.constant(self.trainlbls, dtype=tf.uint16)
self.trainimgs_dict = {}
self.trainimgs_dict["ImageFile"] = self.trainimgs_tensor
return None
#lazy_property
def graph(self):
g = tf.Graph()
with g.as_default():
# Layer definitions go here
return g
#lazy_property
def addfilenames (self):
# This is the function where filenames are pushed to a RandomShuffleQueue
filename_queue = tf.RandomShuffleQueue(capacity=len(self.trainimgs), min_after_dequeue=0,\
dtypes=[tf.string], names=["ImageFile"],\
seed=0, name="filename_queue")
sz_op = filename_queue.size()
dq_op = filename_queue.dequeue()
enq_op = filename_queue.enqueue_many(self.trainimgs_dict)
return filename_queue, enq_op, sz_op, dq_op
def Train(self):
# The function for training.
# I have not written the training part yet.
# Still struggling with preprocessing
with self.graph.as_default():
filename_q, filename_enqueue_op, sz_op, dq_op= self.addfilenames
qr = tf.train.QueueRunner(filename_q, [filename_enqueue_op])
filename_dequeue_op = filename_q.dequeue()
init_op = tf.global_variables_initializer()
sess = tf.Session(graph=self.graph)
sess.run(init_op)
coord = tf.train.Coordinator()
enq_threads = qr.create_threads(sess, coord=coord, start=True)
counter = 0
for step in range(100):
print(sess.run(dq_op["ImageFile"]))
print("Epoch = %d "%(self.epoch))
print("size = %d"%(sess.run(sz_op)))
counter+=1
names = [n.name for n in self.graph.as_graph_def().node]
coord.request_stop()
coord.join(enq_threads)
print("Counter = %d"%(counter))
return None
if __name__ == "__main__":
modeltrain = ModelInitial(<Path to training images>,\
<Path to validation images>)
a = modeltrain.graph
print(a)
modeltrain.Train()
print("Success")
The mystery is caused by the tf.train.QueueRunner that you created for the queue, which causes it to be filled in the background.
The following lines cause a background "queue runner" thread to be created:
qr = tf.train.QueueRunner(filename_q, [filename_enqueue_op])
# ...
enq_threads = qr.create_threads(sess, coord=coord, start=True)
This thread calls filename_enqueue_op in a loop, which causes the queue to be filled up as you remove elements from it.
The background thread from step 1 will almost always have a pending enqueue operation (filename_enqueue_op) on the queue. This means that after you dequeue a filename, the pending enqueue will run add fill the queue back up to capacity. (Technically there is a race condition here and you could see a size of capacity - 1, but this is quite unlikely).