how can i train two tensorflow scripts on single gpu parallelly? - tensorflow

I am getting error when i run 2 tensorflow scripts on single GPU.
I have tried growth and GPU memory allocations steps, still the first script executes without problem while 2nd script failes with ResourceExhaustedError , Graph session creation error.
Kindly help.

Run each program separately first for a few iterations and check nvidia-smi dmon to see how much memory that program actually requires. Then set config.gpu_options.per_process_gpu_memory_fraction = ... in your session configuration based on the memory information you learned from nvidia-smi dmon. If the memory required for both is greater than what you have available then you will run into this resources exhausted error.

You should do the following:
# don't allow cases where a single script takes up all VRAM
# this way we can try to run several scripts at the same time
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
with tf.Session(config=config) as sess:
...
If any of you know how to train two graphs in parallel in a single script, please let me know.

Related

Puzzled by OOM Error on GPU when using 15595MiB / 16125MiB

I am using Tensorflow 2.X.
My GPU memory is 16125MiB, but my model requires 15595MiB, according to nvidia-smi
With this total usage, I get an OOM after some time, even when setting the minimum batch size.
I also tried the following, but as soon as more memory is required, I still go out of memory:
config = tf.compat.v1.ConfigProto()
#config.gpu_options.per_process_gpu_memory_fraction = 0.8
config.gpu_options.allow_growth = True
sess = tf.compat.v1.Session(config=config)
The total memory value should also include the (150MiB) from firefox, which I need to use once in a while, but I doubt killing it would make much difference.
My dataset is loaded through individual .h5 files (one per data sample), the intermediate computation arrays/tensors are deleted when not used.
Unfortunately, I cannot resize the model for several reasons due to an ongoing publication.
Is there any trick I can use to gain that extra 500MiB, to train my model safely, without incurring into OOM half-way?

Running Tensorflow model inference script on multiple GPU

I'm trying to run the model scoring (inference graph) from tensorflow objec detection API to run it on multiple GPU's, tried specifying the GPU number in the main, but it runs only on single GPU.placed GPU utilization snapshot here
Using tensorflow-gpu==1.13.1, can you kindly point me what I'm missing here.
for i in range(2):
with tf.device('/gpu:{}' . format(i)):
tf_init()
init = tf.global_variables_initializer
with detection_graph.as_default():
with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as session:
call to #run_inference_multiple_images function
The responses to this question should give you a few options for fixing this.
Usually TensorFlow will occupy all visible GPUs unless told otherwise. So if you haven't already tried, you could just remove the with tf.device line (assuming you only have the two GPUs) and TensorFlow should use them both.
Otherwise, I think the easiest is setting the environment variables with os.environ["CUDA_VISIBLE_DEVICES"] = "0,1".

How to prevent TPUEstimator from using GPU or TPU

I need to force TPUEstimator to use the CPU. I have a rented google machine and the GPU is already running training. Since the CPUs are idle, I want to start a second Tensorflow session for evaluation but I want to force the evaluation cycle to use CPUs only so that it does not steal GPU time.
I am assuming there is a flag in the run_config or similar for doing this but am struggling to find one in the TF documentation.
run_config = tf.contrib.tpu.RunConfig(
cluster=tpu_cluster_resolver,
master=FLAGS.master,
model_dir=FLAGS.output_dir,
save_checkpoints_steps=FLAGS.save_checkpoints_steps,
tpu_config=tf.contrib.tpu.TPUConfig(
iterations_per_loop=FLAGS.iterations_per_loop,
num_shards=FLAGS.num_tpu_cores,
per_host_input_for_training=is_per_host))
You can run a TPUEstimator locally by including two arguments: (1) use_tpu should be set to False, and (2) tf.contrib.tpu.RunConfig should be passed as the config argument.
my_tpu_estimator = tf.contrib.tpu.TPUEstimator(
model_fn=my_model_fn,
config=tf.contrib.tpu.RunConfig()
use_tpu=False)
The majority of example TPU models can be run in local mode by setting the command line flags:
$> python mnist_tpu.py --use_tpu=false --master=''
More documentation can be found here.

Tensorboard doesn't show runtime/memory for all operations

I have a network implemented in TensorFlow that takes very long to train and therefore want to profile it to see which parts cause the long runtime.
To do that, I follow the instructions here to capture runtime and memory information. My code looks like this:
// define network
loss = ...
train_op = tf.train.AdamOptimizer().minimize(loss, global_step=global_step)
// run forward and backward prop for one batch
run_metadata = tf.RunMetadata()
options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
_,loss,sum = sess.run([train_op,loss,sum], feed_dict=fd, options=options, run_metadata=run_metadata)
writer.add_run_metadata(run_metadata, 'step_%d' % step)
I can then see "session runs" in TensorBoard. However, as soon as I load a session run, most operations in my graph turn orange as shown below and no runtime or memory information is available for them:
According to the legend, these operations are "unsused". But that cannot be the case, as almost everything except "loss" and "opt" are shown like that. Clearly, the whole network has to be used to compute the loss. So I don't really see why the graph is shown like this.
I use TF 1.3 on a Tesla K40c.
I used to have the same problem as you with Tensorboard not registering anything in my session run except the gradient and optimizer ops.
I fixed it by upgrading my version of Tensorflow to the 1.4 release.
Not sure.
Try adding this line
writer.add_summary(_, step)
after
writer.add_run_metadata...

Tensorflow processing performance with multiple gpu

friends!
I have a question about processing with multiple gpu.
I'm using 4 gpus and tried simple A^n + B^n example in 3 way like below.
Single GPU
with tf.device('/gpu:0'):
....tf.matpow codes...
Multiple GPU
with tf.device('/gpu:0'):
....tf.matpow codes...
with tf.device('/gpu:1'):
....tf.matpow codes...
No specific gpu designated (I think maybe all of gpu used)
....just tf.matpow codes...
when tried this, the result was incomprehensible.
the result was
1. single gpu : 6.x seconds
2. multiple gpu(2 gpus) : 2.x seconds
3. no specific gpu designated(maybe 4 gpus) : 4.x seconds
I cannot understand why #2 is faster than #3.
Anyone can help me?
Thanks.
While the Tensorflow scheduler works well for single GPUs, it is not as good at optimizing the placement of computations on multiple GPUs yet. (Although it is being worked on presently.) Without further details, it's hard to know exactly what's going on. To get a get a better picture, you can log where the computations are actually being placed by the scheduler. You can do this by setting the log_device_placement flag on when creating the tf.Session:
# Creates a session with log_device_placement set to True.
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
In the third code sample (where no GPU was designated) Tensorflow didn't use all of your GPUs. By default if Tensorflow can find a GPU ("/gpu:0") to use it assigns as many calculations as possible to that GPU. You would need to tell it specifically that you want want it to use all 4 like you did in the second code sample.
From the Tensorflow documentation:
If you have more than one GPU in your system, the GPU with the lowest ID will be selected by default. If you would like to run on a different GPU, you will need to specify the preference explicitly:
with tf.device('/gpu:2'):
tf code here