Change number of GPUs when deploy - tensorflow

I have a fine-tuned Inception v3 on a 2-GPU machine.
Now I am trying to run the trained model on another machine with 1 GPU, but I got an error like this:
Cannot assign a device to node 'tower_1/gradients/tower_1/conv0/BatchNorm/moments/sufficient_statistics/Sub_grad/BroadcastGradientArgs': Could not satisfy explicit device specification '/device:GPU:1' because no devices matching that specification are registered in this process; available devices: /job:localhost/replica:0/task:0/cpu:0, /job:localhost/replica:0/task:0/gpu:0
It seems that the model wants a 2-GPU environment like the one it was trained on. Can I convert this model so that it uses only 1 GPU?

I changed two things and it worked.
Turn on allow_soft_placement option in Session:
config = tf.ConfigProto(allow_soft_placement=True)
sess = tf.Session(config=config)
Rename model file name from model.ckpt-50000.data-00000-of-00001 to model.ckpt-50000

Related

Running Tensorflow model inference script on multiple GPU

I'm trying to run the model scoring (inference graph) from tensorflow objec detection API to run it on multiple GPU's, tried specifying the GPU number in the main, but it runs only on single GPU.placed GPU utilization snapshot here
Using tensorflow-gpu==1.13.1, can you kindly point me what I'm missing here.
for i in range(2):
with tf.device('/gpu:{}' . format(i)):
tf_init()
init = tf.global_variables_initializer
with detection_graph.as_default():
with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as session:
call to #run_inference_multiple_images function
The responses to this question should give you a few options for fixing this.
Usually TensorFlow will occupy all visible GPUs unless told otherwise. So if you haven't already tried, you could just remove the with tf.device line (assuming you only have the two GPUs) and TensorFlow should use them both.
Otherwise, I think the easiest is setting the environment variables with os.environ["CUDA_VISIBLE_DEVICES"] = "0,1".

How to prevent TPUEstimator from using GPU or TPU

I need to force TPUEstimator to use the CPU. I have a rented google machine and the GPU is already running training. Since the CPUs are idle, I want to start a second Tensorflow session for evaluation but I want to force the evaluation cycle to use CPUs only so that it does not steal GPU time.
I am assuming there is a flag in the run_config or similar for doing this but am struggling to find one in the TF documentation.
run_config = tf.contrib.tpu.RunConfig(
cluster=tpu_cluster_resolver,
master=FLAGS.master,
model_dir=FLAGS.output_dir,
save_checkpoints_steps=FLAGS.save_checkpoints_steps,
tpu_config=tf.contrib.tpu.TPUConfig(
iterations_per_loop=FLAGS.iterations_per_loop,
num_shards=FLAGS.num_tpu_cores,
per_host_input_for_training=is_per_host))
You can run a TPUEstimator locally by including two arguments: (1) use_tpu should be set to False, and (2) tf.contrib.tpu.RunConfig should be passed as the config argument.
my_tpu_estimator = tf.contrib.tpu.TPUEstimator(
model_fn=my_model_fn,
config=tf.contrib.tpu.RunConfig()
use_tpu=False)
The majority of example TPU models can be run in local mode by setting the command line flags:
$> python mnist_tpu.py --use_tpu=false --master=''
More documentation can be found here.

MirroredStrategy doesn't use GPUs

I wanted to use the tf.contrib.distribute.MirroredStrategy() on my Multi GPU System but it doesn't use the GPUs for the training (see the output below). Also I am running tensorflow-gpu 1.12.
I did try to specify the GPUs directly in the MirroredStrategy, but the same problem appeared.
model = models.Model(inputs=input, outputs=y_output)
optimizer = tf.train.AdamOptimizer(LEARNING_RATE)
model.compile(loss=lossFunc, optimizer=optimizer)
NUM_GPUS = 2
strategy = tf.contrib.distribute.MirroredStrategy(num_gpus=NUM_GPUS)
config = tf.estimator.RunConfig(train_distribute=strategy)
estimator = tf.keras.estimator.model_to_estimator(model,
config=config)
These are the results I am getting:
INFO:tensorflow:Device is available but not used by distribute strategy: /device:CPU:0
INFO:tensorflow:Device is available but not used by distribute strategy: /device:GPU:0
INFO:tensorflow:Device is available but not used by distribute strategy: /device:GPU:1
WARNING:tensorflow:Not all devices in DistributionStrategy are visible to TensorFlow session.
The expected result would be obviously to run the training on a Multi GPU system. Are those known issues?
I've been facing a similar issue with MirroredStrategy failing on tensorflow 1.13.1 with 2x RTX2080 running an Estimator.
The failure seems to be in the NCCL all_reduce method (error message - no OpKernel registered for NCCL AllReduce).
I got it to run by changing from NCCL to hierarchical_copy, which meant using the contrib cross_device_ops methods as follows:
Failed command:
mirrored_strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0","/gpu:1"])
Successful command:
mirrored_strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0","/gpu:1"],
cross_device_ops=tf.contrib.distribute.AllReduceCrossDeviceOps(
all_reduce_alg="hierarchical_copy")
)
In TensorFlow new version, AllReduceCrossDeviceOps isn't exist. You may use distribute.HierarchicalCopyAllReduce() instead:
mirrored_strategy = tf.distribute.MirroredStrategy(devices= ["/gpu:0","/gpu:1"],cross_device_ops=tf.distribute.HierarchicalCopyAllReduce())

Tensorflow, restore variables in a specific device

Maybe my question is a bit naive, but I really didn't find anything in the tensorflow documentation.
I have a trained tensorflow model where the variables of it was placed in the GPU. Now I would like to restore this model and test it using the CPU.
If I do this via 'tf.train.Saver.restore` as in the example:
saver = tf.train.import_meta_graph("/tmp/graph.meta")
saver.restore(session, "/tmp/model.ckp")
I have the following excpetion:
InvalidArgumentError: Cannot assign a device to node 'b_fc8/b_fc8/Adam_1': Could not satisfy explicit device specification '/device:GPU:0' because no devices matching that specification are registered in this process; available devices: /job:localhost/replica:0/task:0/cpu:0
How can I make restore these variables in the CPU?
Thanks
Use clear_devices flag, ie
saver = tf.train.import_meta_graph("/tmp/graph.meta", clear_devices=True)
I'm using tensorflow 0.12 and clear_devices=True and tf.device('/cpu:0') was not working with me (saver.restore was still trying to assign variables to /gpu:0).
I really needed to force everything to /cpu:0 since I was loading several models which wouldn't fit in GPU memory anyways. Here are two alternatives to force everything to /cpu:0
Set os.environ['CUDA_VISIBLE_DEVICES']=''
Use the device_count of ConfigProto like tf.Session(config=tf.ConfigProto(device_count={"GPU": 0, "CPU": 1}))

Moving checkpoints around in TensorFlow

I have the following setup: I train a model on our GPU server, save a checkpoint using the tf.train.Saver() functionality within a tf.train.Supervisor(). After training, I want to transfer this model to my laptop and load it for inference purposes.
When attempting to restore the model with self.saver.restore(sess,self.checkpoint_path), (having re-created the proper graph beforehand), I get the following Error:
E tensorflow/core/client/tensor_c_api.cc:485] Cannot assign a device to node 'worker_0/save/Const': Could not satisfy explicit device specification '/job:worker/task:0' because no devices matching that specification are registered in this process; available devices: /job:localhost/replica:0/task:0/cpu:0
Colocation Debug Info:
Colocation group had the following types and devices:
Identity: CPU
Const: CPU
[[Node: worker_0/save/Const = Const[dtype=DT_STRING, value=Tensor<type: string shape: [] values: model>, _device="/job:worker/task:0"]()]]
When analysing the properties of the cpkt object returned by
cpkt = tf.train.get_checkpoint_state(self.checkpoint_dir)
I see that cpkt.model_checkpoint_path points to the original path on the server, where the checkpoint was created, not to self.checkpoint_path, from which I tried to restore the model.
Are these two things connected? Or is there another reason for my above error message.
Any help would be appreciated,
Mat
It sounds like your device assignments are saved, and the same devices are not available in your restore environment.
There's a flag clear_devices in freeze_graph and import_meta_graph which you can use to clear that information.
Alternatively, you can edit the pbtxt with your graph information and manually remove all lines starting with device: