Moving checkpoints around in TensorFlow - tensorflow

I have the following setup: I train a model on our GPU server, save a checkpoint using the tf.train.Saver() functionality within a tf.train.Supervisor(). After training, I want to transfer this model to my laptop and load it for inference purposes.
When attempting to restore the model with self.saver.restore(sess,self.checkpoint_path), (having re-created the proper graph beforehand), I get the following Error:
E tensorflow/core/client/tensor_c_api.cc:485] Cannot assign a device to node 'worker_0/save/Const': Could not satisfy explicit device specification '/job:worker/task:0' because no devices matching that specification are registered in this process; available devices: /job:localhost/replica:0/task:0/cpu:0
Colocation Debug Info:
Colocation group had the following types and devices:
Identity: CPU
Const: CPU
[[Node: worker_0/save/Const = Const[dtype=DT_STRING, value=Tensor<type: string shape: [] values: model>, _device="/job:worker/task:0"]()]]
When analysing the properties of the cpkt object returned by
cpkt = tf.train.get_checkpoint_state(self.checkpoint_dir)
I see that cpkt.model_checkpoint_path points to the original path on the server, where the checkpoint was created, not to self.checkpoint_path, from which I tried to restore the model.
Are these two things connected? Or is there another reason for my above error message.
Any help would be appreciated,
Mat

It sounds like your device assignments are saved, and the same devices are not available in your restore environment.
There's a flag clear_devices in freeze_graph and import_meta_graph which you can use to clear that information.
Alternatively, you can edit the pbtxt with your graph information and manually remove all lines starting with device:

Related

tensorflow summary ops can assign to gpu

Here is part of my code.
with tf.Graph().as_default(), tf.device('/cpu:0'):
global_step = tf.get_variable(
'global_step',
[],
initializer = tf.constant_initializer(0),
writer = tf.summary.FileWriter(logs_path,graph=tf.get_default_graph())
with tf.device('/gpu:0'):
tf.summary.scalar('learning_rate', INITIAL_LEARNING_RATE)
summary_op = tf.summary.merge_all()
when I run it. I will get following error:
InvalidArgumentError (see above for traceback): Cannot assign a device for operation 'learning_rate': Could not satisfy explicit device specification '/device:GPU:0' because no
supported kernel for GPU devices is available.
[[Node: learning_rate = ScalarSummary[T=DT_FLOAT, _device="/device:GPU:0"](learning_rate/tags, learning_rate/values)]]
if I move these 2 ops into tf.device("/cpu:0") device scope, It will work.
tf.summary.scalar('learning_rate', INITIAL_LEARNING_RATE)
summary_op = tf.summary.merge_all()
I google it. there are many suggestiones about using "allow_soft_placement=True". But I think this solution is basically change device scope automatically. So my question is:
why these 2 ops can not assign to gpu? Is there any documents I can look at to figure out what ops can or cannot assign to gpu?
any suggestion is welcome.
You can't assign a summary operation to a GPU because is meaningless.
In short, a GPU executes parallel operations. A summary is nothing but a file in which you append new lines every time you write on it. It's a sequential operation that has nothing in common with the operation that GPUs are capable to do.
Your error says it all:
Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available.
That operation (in the tensorflow version you're using) has no GPU implementation and thus must be sent to a CPU device.

Distributed Tensorflow save fails no device

I am following the example here to create a distributed tensorflow model with a parameter server and n workers. I do not have any GPU, all work is distributed on CPU
In the chief worker, I want to save my variables every some steps, but invoking the saver results in the following exception :
Cannot assign a device to node 'save_1/RestoreV2_21':
Could not satisfy explicit device specification
'/job:ps/task:0/device:CPU:0' because no devices matching that
specification are registered in this process; available devices:
/job:localhost/replica:0/task:0/cpu:0
[[Node: save_1/RestoreV2_21 = RestoreV2[dtypes=[DT_INT32],
_device="/job:ps/task:0/device:CPU:0"](save_1/Const,
save_1/RestoreV2_21/tensor_names, save_1/RestoreV2_21/shape_and_slices)]]
I tried :
server = tf.train.Server(cluster,
job_name=self.calib.params['job_name'],
task_index=self.calib.params['task_index'],
config=tf.ConfigProto(allow_soft_placement=True)
I am using a supervisor :
sv = tf.train.Supervisor(
is_chief=is_chief,
...)
and creating my sesion as follows :
sess = sv.prepare_or_wait_for_session(server.target)
but I am still having the exact same error
This line in the error message:
available devices: /job:localhost/replica:0/task:0/cpu:0
...suggests that your tf.Session is not connected to the tf.train.Server you created. In particular, it seems to be a local (or "direct") session that can only access devices in the local process.
To fix this problem, when you create your session, pass server.target to the initializer. For example, depending on which API you are using to create the session, you might want to use one of the following:
# Creating a session explicitly.
with tf.Session(server.target) as sess:
# ...
# Using a `tf.train.Supervisor` called `sv`.
with sv.managed_session(server.target):
# ...
# Using a `tf.train.MonitoredTrainingSession`.
with tf.train.MonitoredTrainingSession(server.target):
# ...

Change number of GPUs when deploy

I have a fine-tuned Inception v3 on a 2-GPU machine.
Now I am trying to run the trained model on another machine with 1 GPU, but I got an error like this:
Cannot assign a device to node 'tower_1/gradients/tower_1/conv0/BatchNorm/moments/sufficient_statistics/Sub_grad/BroadcastGradientArgs': Could not satisfy explicit device specification '/device:GPU:1' because no devices matching that specification are registered in this process; available devices: /job:localhost/replica:0/task:0/cpu:0, /job:localhost/replica:0/task:0/gpu:0
It seems that the model wants a 2-GPU environment like the one it was trained on. Can I convert this model so that it uses only 1 GPU?
I changed two things and it worked.
Turn on allow_soft_placement option in Session:
config = tf.ConfigProto(allow_soft_placement=True)
sess = tf.Session(config=config)
Rename model file name from model.ckpt-50000.data-00000-of-00001 to model.ckpt-50000

Tensorflow, restore variables in a specific device

Maybe my question is a bit naive, but I really didn't find anything in the tensorflow documentation.
I have a trained tensorflow model where the variables of it was placed in the GPU. Now I would like to restore this model and test it using the CPU.
If I do this via 'tf.train.Saver.restore` as in the example:
saver = tf.train.import_meta_graph("/tmp/graph.meta")
saver.restore(session, "/tmp/model.ckp")
I have the following excpetion:
InvalidArgumentError: Cannot assign a device to node 'b_fc8/b_fc8/Adam_1': Could not satisfy explicit device specification '/device:GPU:0' because no devices matching that specification are registered in this process; available devices: /job:localhost/replica:0/task:0/cpu:0
How can I make restore these variables in the CPU?
Thanks
Use clear_devices flag, ie
saver = tf.train.import_meta_graph("/tmp/graph.meta", clear_devices=True)
I'm using tensorflow 0.12 and clear_devices=True and tf.device('/cpu:0') was not working with me (saver.restore was still trying to assign variables to /gpu:0).
I really needed to force everything to /cpu:0 since I was loading several models which wouldn't fit in GPU memory anyways. Here are two alternatives to force everything to /cpu:0
Set os.environ['CUDA_VISIBLE_DEVICES']=''
Use the device_count of ConfigProto like tf.Session(config=tf.ConfigProto(device_count={"GPU": 0, "CPU": 1}))

Device placement unknown in Tensorboard

I'd like to investigate device placement in the tensorboard using the following code for generating the graph in the summary
# Build the summary operation based on the TF collection of Summaries.
summary_op = tf.merge_all_summaries()
saver = tf.train.Saver(tf.all_variables())
summary_writer = tf.train.SummaryWriter(log_directory, graph_def=sess.graph_def)
This works for displaying the graph and summaries defined in the graph. But when selecting 'device placement' in the tensorboard, all nodes are assigned to 'unknown device'. Do I need to dump the device placement in some other way?
The TensorBoard graph visualizer only sees the explicit device assignments that you have made in your program (i.e. those made using with tf.Device("..."): blocks).
The reason for this is that the nodes in a TensorFlow graph are assigned to devices in multiple stages. The first stage, in the client (e.g. your Python program) allows you to explicitly—and optionally—assign devices to each node, and it is the output of this stage that is written to the TensorBoard logs. A later placement stage runs inside the TensorFlow backend, and assigns every node to a device.
I suspect you want to analyze the results of the later placement stage. Currently there is no support for this in TensorBoard, but you can extract some information by creating the tf.Session as follows:
sess = tf.Session(config=tf.ConfigProto(
log_device_placement=True))
…and then the device placement decisions will be logged to stderr.