Allocating Large Tensor on multiple GPUs using Distributed Learning in Keras - tensorflow

I am using Tensorflow Distributed learning using the following commands -
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="0,1,2,3"
strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
model = Basic_Model()
model.compile(loss='mean_squared_error', optimizer=rms, metrics=['mean_squared_error'])
The system being used has 4 32 GB GPU devices. The following is the output of nvidia-smi -
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01 Driver Version: 418.87.01 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000004:04:00.0 Off | 0 |
| N/A 37C P0 65W / 300W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000004:05:00.0 Off | 0 |
| N/A 38C P0 40W / 300W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 00000035:03:00.0 Off | 0 |
| N/A 33C P0 40W / 300W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... On | 00000035:04:00.0 Off | 0 |
| N/A 39C P0 41W / 300W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
But after running the script to create the model, I am getting the following error -
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape [131072,65536] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:RandomUniform]
A tensor of shape [131072,65536] of type float would allocate 131072 * 65536 * 4 bytes i.e., 34.35 GB. And there are 4 32 GB GPUs, so why is it not allocated?

MirroredStrategy creates a copy of all variables within the scope per GPU. So since the tensor size is 34.35GB, that's too large. You might be trying to use something similar to tf.distribute.experimental.CentralStorageStrategy. MirroredStrategy, in terms of gpu memory, isn't vram * num_of_gpu, it practically is smallest_vram, so in your case, Keras is working with 32GB of memory per replica, not 32*4=128GB.
strategy = tf.distribute.experimental.CentralStorageStrategy()
dataset = # some dataset
dataset = strategy.experimental_distribute_dataset(dataset)
with strategy.scope():
model = Basic_Model()
model.compile(loss='mean_squared_error', optimizer=rms, metrics=['mean_squared_error'])
Example:
Tensor A is [0, 1, 2, 3] and you have four GPUs. MirroredStrategy will load:
GPU0: [0, 1, 2, 3]
GPU1: [0, 1, 2, 3]
GPU2: [0, 1, 2, 3]
GPU3: [0, 1, 2, 3]
NOT
GPU0: [0]
GPU1: [1]
GPU2: [2]
GPU3: [3]
As you can see, MirroredStrategy requires all your available devices to be able to hold all of the data, therefore, you're limited to your smallest device when using this strategy.

Related

Using GPU error when use TensorFlow to train image

When I am runing a tensorflow image train job in the container tensorflow/tensorflow:latest-gpu, it doesn't work.
Error message:
Cannot assign a device for operation InceptionV3/InceptionV3/Conv2d_1a_3x3/Conv2D: Operation was explicitly assigned to /device:GPU:0 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:XLA_CPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:0 ]. Make sure the device specification refers to a valid device.
[[node InceptionV3/InceptionV3/Conv2d_1a_3x3/Conv2D (defined at /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/layers/python/layers/layers.py:1057) = Conv2D[T=DT_FLOAT, data_format="NHWC", dilations=[1, 1, 1, 1], padding="VALID", strides=[1, 2, 2, 1], use_cudnn_on_gpu=true, _device="/device:GPU:0"](fifo_queue_Dequeue, InceptionV3/Conv2d_1a_3x3/weights/read)]]
GPU info:
nvidia-smi
Mon Nov 26 07:48:59 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.72 Driver Version: 410.72 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GT 630 Off | 00000000:01:00.0 N/A | N/A |
| 25% 47C P0 N/A / N/A | 0MiB / 1998MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 Not Supported |
+-----------------------------------------------------------------------------+
It seems that you Tensorflow is not detecting any gpu as available but maps the operations to GPU:0. First try this:
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())
And you'll get the available devices. Is there /device:GPU:0 ?

CUDA_ERROR_OUT_OF_MEMORY during tf export mode despite turning gpu completely off already

I am currently running a few Tensorflow training jobs with gpu and am trying to export models from one such job. I have set
os.environ['CUDA_VISIBLE_DEVICES'] = '-1'
both in code and in terminal. Also I have removed all mentions of gpu devices in the training code, as well as moved graph.pbtxt away. I used inspect_checkpoint.py to see that the model checkpoint keys contain no mention of gpu either. I have also set
session_config = tf.ConfigProto(
device_count={'GPU': 0 if export else config.num_gpus},
allow_soft_placement=True,
gpu_options=None if export else tf.GPUOptions(allow_growth=True))
Still I am getting the following error message towards the end of export:
2018-09-15 03:20:30.597742: E
tensorflow/core/common_runtime/direct_session.cc:158] Internal: failed
initializing StreamExecutor for CUDA device ordinal 0: Internal: failed
call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY; total memory
reported: 16936861696
nvidia-smi | head -20
Sat Sep 15 03:25:28 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.130 Driver Version: 384.130 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... Off | 00000000:02:00.0 Off | 0 |
| N/A 35C P0 50W / 250W | 15800MiB / 16152MiB | 89% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-PCIE... Off | 00000000:04:00.0 Off | 0 |
| N/A 35C P0 37W / 250W | 0MiB / 16152MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-PCIE... Off | 00000000:83:00.0 Off | 0 |
| N/A 36C P0 37W / 250W | 0MiB / 16152MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-PCIE... Off | 00000000:84:00.0 Off | 0 |
| N/A 38C P0 39W / 250W | 0MiB / 16152MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

How to query NVIDIA GPU parameters for a certain PID?

I know with nvidia-smi an overview is generated like:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.66 Driver Version: 375.66 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro P4000 Off | 0000:01:00.0 Off | N/A |
| N/A 43C P0 26W / N/A | 227MiB / 8114MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1724 G /usr/bin/X 219MiB |
| 0 8074 G qtcreator 6MiB |
+-----------------------------------------------------------------------------+
However, for the parameters I'd like to break it down for each process (e.g. GPU usage, used memory). I can't find a respective query, but then again I can't imagine that such a basic function is not implemented. Hence
Is there an easy way to display the GPU parameters for each process?
I don't think it gets any closer to nvidia-smi pmon:
# gpu pid type sm mem enc dec fb command
# Idx # C/G % % % % MB name
0 1750 G 1 0 0 0 179 X
0 3734 G 0 0 0 0 7 qtcreator

GPU load in tensorflow

I just built TensorFlow v1.0 and I am trying to run MNIST test just to see if it's working. Seems like it is, but i am observing weird behaiviour.
My system has two Tesla P100, and nvidia-smi shows the following:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 361.107 Driver Version: 361.107 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-SXM2... Off | 0002:01:00.0 Off | 0 |
| N/A 34C P0 114W / 300W | 15063MiB / 16280MiB | 51% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P100-SXM2... Off | 0006:01:00.0 Off | 0 |
| N/A 27C P0 35W / 300W | 14941MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 67288 C python3 15061MiB |
| 1 67288 C python3 14939MiB |
+-----------------------------------------------------------------------------+
As it shown, python3 ate all the memory on both GPUs, but computational load are placed only on first.
Exporting CUDA_VISIBLE_DEVICES I can limit GPU to be used, but it's not affect computational time. So no gain from adding second GPU. Single GPU
real 2m23.496s
user 4m26.597s
sys 0m12.587s
Two GPUs:
real 2m18.165s
user 4m18.625s
sys 0m12.958s
So the question is, how to load both GPUs?

Specify gpu in Tensorflow code: /gpu:0 is always working?

I have 3 graphics cards in my workstation, one of them is Quadro K620, and the other two are Titan X. Now I would like to run my tensorflow code in one of the graphics card, so that I can leave the others idle for another task.
However, regardless of setting tf.device('/gpu:0') or tf.device('/gpu:1'), I found the 1st Titan X graphics card is always working, I don't know why.
import argparse
import os
import time
import tensorflow as tf
import numpy as np
import cv2
from Dataset import Dataset
from Net import Net
FLAGS = None
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument('--foldername', type=str, default='./data-large/')
parser.add_argument('--batch_size', type=int, default=100)
parser.add_argument('--num_epoches', type=int, default=100)
parser.add_argument('--learning_rate', type=float, default=0.5)
FLAGS = parser.parse_args()
net = Net(FLAGS.batch_size, FLAGS.learning_rate)
with tf.Graph().as_default():
# Dataset is a class for encapsulate the input pipeline
dataset = Dataset(foldername=FLAGS.foldername,
batch_size=FLAGS.batch_size,
num_epoches=FLAGS.num_epoches)
images, labels = dataset.samples_train
## The following code defines the network and train
with tf.device('/gpu:0'): # <==== THIS LINE
logits = net.inference(images)
loss = net.loss(logits, labels)
train_op = net.training(loss)
init_op = tf.group(tf.initialize_all_variables(), tf.initialize_local_variables())
sess = tf.Session()
sess.run(init_op)
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)
start_time = time.time()
try:
step = 0
while not coord.should_stop():
_, loss_value = sess.run([train_op, loss])
step = step + 1
if step % 100 == 0:
format_str = ('step %d, loss = %.2f, time: %.2f seconds')
print(format_str % (step, loss_value, (time.time() - start_time)))
start_time = time.time()
except tf.errors.OutOfRangeError:
print('done')
finally:
coord.request_stop()
coord.join(threads)
sess.close()
Regarding to the line "<=== THIS LINE:"
If I set tf.device('/gpu:0'), the monitor says:
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro K620 Off | 0000:03:00.0 On | N/A |
| 34% 45C P0 2W / 30W | 404MiB / 1993MiB | 5% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX TIT... Off | 0000:04:00.0 Off | N/A |
| 22% 39C P2 100W / 250W | 11691MiB / 12206MiB | 8% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX TIT... Off | 0000:81:00.0 Off | N/A |
| 22% 43C P2 71W / 250W | 111MiB / 12206MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
showing the 1st Titan X card is working.
If I set tf.device('/gpu:1'), the monitor says:
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro K620 Off | 0000:03:00.0 On | N/A |
| 34% 45C P0 2W / 30W | 411MiB / 1993MiB | 3% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX TIT... Off | 0000:04:00.0 Off | N/A |
| 22% 52C P2 73W / 250W | 11628MiB / 12206MiB | 12% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX TIT... Off | 0000:81:00.0 Off | N/A |
| 22% 42C P2 71W / 250W | 11628MiB / 12206MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
showing that the two Titan X cards are working, not the 2nd Titan X alone.
So any reason behind this and how to specify the gpu I want my program to run in?
Just a guess, but the default behavior for a tf.train.Optimizer object (which I expect is created in net.training(loss)) when you call minimize() is colocate_gradients_with_ops=False. This may lead to the backpropagation ops being placed on the default device, which will be /gpu:0.
To work out if this is happening, you can iterate over sess.graph_def and look for nodes that either have /gpu:0 in the NodeDef.device field, or have an empty device field (in which case they will be placed on /gpu:0 by default).
Another option for checking what devices are being used is to use the output_partition_graphs=True option when running your step. This shows what devices TensorFlow is actually using (instead of, in sess.graph_def, what devices your program is requesting), and should show exactly what nodes are running on /gpu:0.