Distributed Tensorflow save fails no device - tensorflow

I am following the example here to create a distributed tensorflow model with a parameter server and n workers. I do not have any GPU, all work is distributed on CPU
In the chief worker, I want to save my variables every some steps, but invoking the saver results in the following exception :
Cannot assign a device to node 'save_1/RestoreV2_21':
Could not satisfy explicit device specification
'/job:ps/task:0/device:CPU:0' because no devices matching that
specification are registered in this process; available devices:
/job:localhost/replica:0/task:0/cpu:0
[[Node: save_1/RestoreV2_21 = RestoreV2[dtypes=[DT_INT32],
_device="/job:ps/task:0/device:CPU:0"](save_1/Const,
save_1/RestoreV2_21/tensor_names, save_1/RestoreV2_21/shape_and_slices)]]
I tried :
server = tf.train.Server(cluster,
job_name=self.calib.params['job_name'],
task_index=self.calib.params['task_index'],
config=tf.ConfigProto(allow_soft_placement=True)
I am using a supervisor :
sv = tf.train.Supervisor(
is_chief=is_chief,
...)
and creating my sesion as follows :
sess = sv.prepare_or_wait_for_session(server.target)
but I am still having the exact same error

This line in the error message:
available devices: /job:localhost/replica:0/task:0/cpu:0
...suggests that your tf.Session is not connected to the tf.train.Server you created. In particular, it seems to be a local (or "direct") session that can only access devices in the local process.
To fix this problem, when you create your session, pass server.target to the initializer. For example, depending on which API you are using to create the session, you might want to use one of the following:
# Creating a session explicitly.
with tf.Session(server.target) as sess:
# ...
# Using a `tf.train.Supervisor` called `sv`.
with sv.managed_session(server.target):
# ...
# Using a `tf.train.MonitoredTrainingSession`.
with tf.train.MonitoredTrainingSession(server.target):
# ...

Related

TensorFlow - Diffrence between Session() and Session(Graph())

To run a programm with TensorFlow, we must declare a session.
So what is the difference between sess = Session() and sess = Session(Graph()) ?
What is this Graph() ?
When designing a Model in Tensorflow, there are basically 2 steps
building the computational graph, the nodes and operations and how
they are connected to each other
evaluating / running this graph on
some data
A Session object encapsulates the environment in which Operation objects are executed, and Tensor objects are evaluated. For example:
# Launch the graph in a session.
sess = tf.Session()
# Evaluate the tensor `c`.
print(sess.run(c))
When you create a Session you're placing a graph into a specified device and If no graph is specified, the Session constructor tries to build a graph using the default one .
sess = tf.Session()
Else during initializing tf.Session(), you can pass in a graph like tf.Session(graph=my_graph)
with tf.Session(graph=my_graph) as sess:
https://www.tensorflow.org/api_docs/python/tf/Session
https://www.tensorflow.org/api_docs/python/tf/Graph
https://github.com/Kulbear/tensorflow-for-deep-learning-research/issues/1
Designing a model in TensorFlow assumes these two parts:
Building graph(s), representing the data flow of the computations.
Running a session(s), executing the operations in the graph.
In general case, there can be multiple graphs and multiple sessions.
But there is always one default graph and one default session.
In that context sess = Session() would assume the default session:
If no graph argument is specified when constructing the session, the default graph will be launched in the session.
sess = Session(Graph()) would assume you are using more than one graph.
If you are using more than one graph (created with tf.Graph() in the same process, you will have to use different sessions for each graph, but each graph can be used in multiple sessions. In this case, it is often clearer to pass the graph to be launched explicitly to the session constructor.

tensorflow summary ops can assign to gpu

Here is part of my code.
with tf.Graph().as_default(), tf.device('/cpu:0'):
global_step = tf.get_variable(
'global_step',
[],
initializer = tf.constant_initializer(0),
writer = tf.summary.FileWriter(logs_path,graph=tf.get_default_graph())
with tf.device('/gpu:0'):
tf.summary.scalar('learning_rate', INITIAL_LEARNING_RATE)
summary_op = tf.summary.merge_all()
when I run it. I will get following error:
InvalidArgumentError (see above for traceback): Cannot assign a device for operation 'learning_rate': Could not satisfy explicit device specification '/device:GPU:0' because no
supported kernel for GPU devices is available.
[[Node: learning_rate = ScalarSummary[T=DT_FLOAT, _device="/device:GPU:0"](learning_rate/tags, learning_rate/values)]]
if I move these 2 ops into tf.device("/cpu:0") device scope, It will work.
tf.summary.scalar('learning_rate', INITIAL_LEARNING_RATE)
summary_op = tf.summary.merge_all()
I google it. there are many suggestiones about using "allow_soft_placement=True". But I think this solution is basically change device scope automatically. So my question is:
why these 2 ops can not assign to gpu? Is there any documents I can look at to figure out what ops can or cannot assign to gpu?
any suggestion is welcome.
You can't assign a summary operation to a GPU because is meaningless.
In short, a GPU executes parallel operations. A summary is nothing but a file in which you append new lines every time you write on it. It's a sequential operation that has nothing in common with the operation that GPUs are capable to do.
Your error says it all:
Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available.
That operation (in the tensorflow version you're using) has no GPU implementation and thus must be sent to a CPU device.

Change number of GPUs when deploy

I have a fine-tuned Inception v3 on a 2-GPU machine.
Now I am trying to run the trained model on another machine with 1 GPU, but I got an error like this:
Cannot assign a device to node 'tower_1/gradients/tower_1/conv0/BatchNorm/moments/sufficient_statistics/Sub_grad/BroadcastGradientArgs': Could not satisfy explicit device specification '/device:GPU:1' because no devices matching that specification are registered in this process; available devices: /job:localhost/replica:0/task:0/cpu:0, /job:localhost/replica:0/task:0/gpu:0
It seems that the model wants a 2-GPU environment like the one it was trained on. Can I convert this model so that it uses only 1 GPU?
I changed two things and it worked.
Turn on allow_soft_placement option in Session:
config = tf.ConfigProto(allow_soft_placement=True)
sess = tf.Session(config=config)
Rename model file name from model.ckpt-50000.data-00000-of-00001 to model.ckpt-50000

Moving checkpoints around in TensorFlow

I have the following setup: I train a model on our GPU server, save a checkpoint using the tf.train.Saver() functionality within a tf.train.Supervisor(). After training, I want to transfer this model to my laptop and load it for inference purposes.
When attempting to restore the model with self.saver.restore(sess,self.checkpoint_path), (having re-created the proper graph beforehand), I get the following Error:
E tensorflow/core/client/tensor_c_api.cc:485] Cannot assign a device to node 'worker_0/save/Const': Could not satisfy explicit device specification '/job:worker/task:0' because no devices matching that specification are registered in this process; available devices: /job:localhost/replica:0/task:0/cpu:0
Colocation Debug Info:
Colocation group had the following types and devices:
Identity: CPU
Const: CPU
[[Node: worker_0/save/Const = Const[dtype=DT_STRING, value=Tensor<type: string shape: [] values: model>, _device="/job:worker/task:0"]()]]
When analysing the properties of the cpkt object returned by
cpkt = tf.train.get_checkpoint_state(self.checkpoint_dir)
I see that cpkt.model_checkpoint_path points to the original path on the server, where the checkpoint was created, not to self.checkpoint_path, from which I tried to restore the model.
Are these two things connected? Or is there another reason for my above error message.
Any help would be appreciated,
Mat
It sounds like your device assignments are saved, and the same devices are not available in your restore environment.
There's a flag clear_devices in freeze_graph and import_meta_graph which you can use to clear that information.
Alternatively, you can edit the pbtxt with your graph information and manually remove all lines starting with device:

Device placement unknown in Tensorboard

I'd like to investigate device placement in the tensorboard using the following code for generating the graph in the summary
# Build the summary operation based on the TF collection of Summaries.
summary_op = tf.merge_all_summaries()
saver = tf.train.Saver(tf.all_variables())
summary_writer = tf.train.SummaryWriter(log_directory, graph_def=sess.graph_def)
This works for displaying the graph and summaries defined in the graph. But when selecting 'device placement' in the tensorboard, all nodes are assigned to 'unknown device'. Do I need to dump the device placement in some other way?
The TensorBoard graph visualizer only sees the explicit device assignments that you have made in your program (i.e. those made using with tf.Device("..."): blocks).
The reason for this is that the nodes in a TensorFlow graph are assigned to devices in multiple stages. The first stage, in the client (e.g. your Python program) allows you to explicitly—and optionally—assign devices to each node, and it is the output of this stage that is written to the TensorBoard logs. A later placement stage runs inside the TensorFlow backend, and assigns every node to a device.
I suspect you want to analyze the results of the later placement stage. Currently there is no support for this in TensorBoard, but you can extract some information by creating the tf.Session as follows:
sess = tf.Session(config=tf.ConfigProto(
log_device_placement=True))
…and then the device placement decisions will be logged to stderr.