Specify gpu in Tensorflow code: /gpu:0 is always working? - tensorflow

I have 3 graphics cards in my workstation, one of them is Quadro K620, and the other two are Titan X. Now I would like to run my tensorflow code in one of the graphics card, so that I can leave the others idle for another task.
However, regardless of setting tf.device('/gpu:0') or tf.device('/gpu:1'), I found the 1st Titan X graphics card is always working, I don't know why.
import argparse
import os
import time
import tensorflow as tf
import numpy as np
import cv2
from Dataset import Dataset
from Net import Net
FLAGS = None
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument('--foldername', type=str, default='./data-large/')
parser.add_argument('--batch_size', type=int, default=100)
parser.add_argument('--num_epoches', type=int, default=100)
parser.add_argument('--learning_rate', type=float, default=0.5)
FLAGS = parser.parse_args()
net = Net(FLAGS.batch_size, FLAGS.learning_rate)
with tf.Graph().as_default():
# Dataset is a class for encapsulate the input pipeline
dataset = Dataset(foldername=FLAGS.foldername,
batch_size=FLAGS.batch_size,
num_epoches=FLAGS.num_epoches)
images, labels = dataset.samples_train
## The following code defines the network and train
with tf.device('/gpu:0'): # <==== THIS LINE
logits = net.inference(images)
loss = net.loss(logits, labels)
train_op = net.training(loss)
init_op = tf.group(tf.initialize_all_variables(), tf.initialize_local_variables())
sess = tf.Session()
sess.run(init_op)
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)
start_time = time.time()
try:
step = 0
while not coord.should_stop():
_, loss_value = sess.run([train_op, loss])
step = step + 1
if step % 100 == 0:
format_str = ('step %d, loss = %.2f, time: %.2f seconds')
print(format_str % (step, loss_value, (time.time() - start_time)))
start_time = time.time()
except tf.errors.OutOfRangeError:
print('done')
finally:
coord.request_stop()
coord.join(threads)
sess.close()
Regarding to the line "<=== THIS LINE:"
If I set tf.device('/gpu:0'), the monitor says:
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro K620 Off | 0000:03:00.0 On | N/A |
| 34% 45C P0 2W / 30W | 404MiB / 1993MiB | 5% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX TIT... Off | 0000:04:00.0 Off | N/A |
| 22% 39C P2 100W / 250W | 11691MiB / 12206MiB | 8% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX TIT... Off | 0000:81:00.0 Off | N/A |
| 22% 43C P2 71W / 250W | 111MiB / 12206MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
showing the 1st Titan X card is working.
If I set tf.device('/gpu:1'), the monitor says:
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro K620 Off | 0000:03:00.0 On | N/A |
| 34% 45C P0 2W / 30W | 411MiB / 1993MiB | 3% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX TIT... Off | 0000:04:00.0 Off | N/A |
| 22% 52C P2 73W / 250W | 11628MiB / 12206MiB | 12% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX TIT... Off | 0000:81:00.0 Off | N/A |
| 22% 42C P2 71W / 250W | 11628MiB / 12206MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
showing that the two Titan X cards are working, not the 2nd Titan X alone.
So any reason behind this and how to specify the gpu I want my program to run in?

Just a guess, but the default behavior for a tf.train.Optimizer object (which I expect is created in net.training(loss)) when you call minimize() is colocate_gradients_with_ops=False. This may lead to the backpropagation ops being placed on the default device, which will be /gpu:0.
To work out if this is happening, you can iterate over sess.graph_def and look for nodes that either have /gpu:0 in the NodeDef.device field, or have an empty device field (in which case they will be placed on /gpu:0 by default).
Another option for checking what devices are being used is to use the output_partition_graphs=True option when running your step. This shows what devices TensorFlow is actually using (instead of, in sess.graph_def, what devices your program is requesting), and should show exactly what nodes are running on /gpu:0.

Related

Allocating Large Tensor on multiple GPUs using Distributed Learning in Keras

I am using Tensorflow Distributed learning using the following commands -
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="0,1,2,3"
strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
model = Basic_Model()
model.compile(loss='mean_squared_error', optimizer=rms, metrics=['mean_squared_error'])
The system being used has 4 32 GB GPU devices. The following is the output of nvidia-smi -
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01 Driver Version: 418.87.01 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000004:04:00.0 Off | 0 |
| N/A 37C P0 65W / 300W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000004:05:00.0 Off | 0 |
| N/A 38C P0 40W / 300W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 00000035:03:00.0 Off | 0 |
| N/A 33C P0 40W / 300W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... On | 00000035:04:00.0 Off | 0 |
| N/A 39C P0 41W / 300W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
But after running the script to create the model, I am getting the following error -
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape [131072,65536] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:RandomUniform]
A tensor of shape [131072,65536] of type float would allocate 131072 * 65536 * 4 bytes i.e., 34.35 GB. And there are 4 32 GB GPUs, so why is it not allocated?
MirroredStrategy creates a copy of all variables within the scope per GPU. So since the tensor size is 34.35GB, that's too large. You might be trying to use something similar to tf.distribute.experimental.CentralStorageStrategy. MirroredStrategy, in terms of gpu memory, isn't vram * num_of_gpu, it practically is smallest_vram, so in your case, Keras is working with 32GB of memory per replica, not 32*4=128GB.
strategy = tf.distribute.experimental.CentralStorageStrategy()
dataset = # some dataset
dataset = strategy.experimental_distribute_dataset(dataset)
with strategy.scope():
model = Basic_Model()
model.compile(loss='mean_squared_error', optimizer=rms, metrics=['mean_squared_error'])
Example:
Tensor A is [0, 1, 2, 3] and you have four GPUs. MirroredStrategy will load:
GPU0: [0, 1, 2, 3]
GPU1: [0, 1, 2, 3]
GPU2: [0, 1, 2, 3]
GPU3: [0, 1, 2, 3]
NOT
GPU0: [0]
GPU1: [1]
GPU2: [2]
GPU3: [3]
As you can see, MirroredStrategy requires all your available devices to be able to hold all of the data, therefore, you're limited to your smallest device when using this strategy.

"Adding visible gpu devices: 0.." has been constantly outputted in nohup.out

I'm running a pso program and I use tensorflow to calculate the mean square error as fitness, but every minute there's an output in nohup.out beginning with "Adding visible gpu devices: 0". I use both gpu and cpu to run my code, but 5 days later they have almost the same rate. Why dose gpu run so slow? How can I stop the constant output?
I use the No.2 device, and the gpu seems to work.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.130 Driver Version: 384.130 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN Xp Off | 00000000:02:00.0 Off | N/A |
| 23% 39C P0 61W / 250W | 0MiB / 12189MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 TITAN Xp Off | 00000000:04:00.0 Off | N/A |
| 23% 40C P0 60W / 250W | 0MiB / 12189MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 TITAN Xp Off | 00000000:83:00.0 Off | N/A |
| 23% 39C P8 17W / 250W | 7261MiB / 12189MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 TITAN Xp Off | 00000000:84:00.0 Off | N/A |
| 23% 37C P0 58W / 250W | 0MiB / 12189MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 2 26737 C python 7251MiB
nohup.out is printing the imformation every minute:
2019-11-08 11:49:18.032239: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0
2019-11-08 11:49:18.032316: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-11-08 11:49:18.032326: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0
2019-11-08 11:49:18.032332: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N
2019-11-08 11:49:18.032511: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6991 MB memory) -> physical GPU (device: 0, name: TITAN Xp, pci bus id: 0000:83:00.0, compute capability: 6.1)
2019-11-08 11:50:42.409142: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0
2019-11-08 11:50:42.409214: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-11-08 11:50:42.409223: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0
2019-11-08 11:50:42.409229: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N
2019-11-08 11:50:42.409407: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6991 MB memory) -> physical GPU (device: 0, name: TITAN Xp, pci bus id: 0000:83:00.0, compute capability: 6.1)
The objective function is like this:
def function(self, M, w_h, w_o):
def model(X, w_h, w_o):
h = tf.matmul(X, w_h)
return tf.matmul(h, w_o)
X = tf.placeholder(tf.float64, [None, 4])
Y = tf.placeholder(tf.float64, [None, 2])
w_h = tf.Variable(w_h)
w_o = tf.Variable(w_o)
py_x = model(X, w_h, w_o)
loss = tf.reduce_mean((py_x-Y)**2)
with tf.Session() as sess:
tf.initializers.global_variables().run()
sum = 0
length = len(trY)
for start, end in zip(range(0, length, batchsize), range(batchsize, length + 1, batchsize)):
sum += sess.run(loss, feed_dict={X: trX[start:end], Y: trY[start:end]})
if not length%batchsize:
E = sum/(length/batchsize)
else:
E = sum/(1+floor(length/batchsize))
return E

How to measure (manually) how much of my GPU memory is used / available

I'm experimenting with Tensorflow by creating different models and testing them. The problem I'm having right now is that I don't have a clear sense of how big my model could be before I face the OOM (out of memory) error. For sure, I can keep adding layers or making them bigger and see when I'm hitting the limit. But it would be nice if I could measure how much memory my models are occupying.
Using nvidia-smi command I can see this:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.40 Driver Version: 430.40 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:01:00.0 Off | N/A |
| 14% 56C P2 87W / 280W | 10767MiB / 11178MiB | 37% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 24329 C python 10747MiB |
+-----------------------------------------------------------------------------+
But the problem I have with the memory usage here is that no matter how small or big my model is, it always shows the same number! As if Tensorflow will take all no matter what.
I'm looking for a formula that gives me the memory needed by providing it the number of model parameters, batch size, and sample size.
You can use next function:
def get_max_memory_usage(sess):
"""Might be unprecise. Run after training"""
if sess is None: sess = tf.get_default_session()
max_mem = int(sess.run(tf.contrib.memory_stats.MaxBytesInUse()))
print("Contrib max memory: %f G" % (max_mem / 1024. / 1024. / 1024.))
return max_mem
Or look into mem_utils by Vladislav Bulatov

TensorFlow with 4-GPU doesn't speed up the training

My code first:
from sklearn.datasets.samples_generator import make_blobs
from matplotlib import pyplot
from numpy import where
from keras.utils import to_categorical
from keras.layers import Dense
from keras.models import Sequential
from keras.optimizers import SGD
from keras.utils import multi_gpu_model
X, y = make_blobs(n_samples=1000000, centers=3, n_features=3, cluster_std=2, random_state=2)
y = to_categorical(y)
n_train = 500000
trainX, testX = X[:n_train, :], X[n_train:, :]
trainY, testY = y[:n_train], y[n_train:]
model = Sequential()
model.add(Dense(50, input_dim=3, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(3, activation='softmax'))
p_model = multi_gpu_model(model, gpus=4)
opt = SGD(lr=0.01, momentum=0.9)
p_model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])
history = p_model.fit(trainX, trainY, validation_data=(testX, testY), epochs=20, verbose=1, batch_size=32)
_, train_acc = p_model.evaluate(trainX, trainY, verbose=0)
_, test_acc = p_model.evaluate(testX, testY, verbose=0)
print("Train: %.3f, Test: %.3f" % (train_acc, test_acc))
The way to use the 4 powerful GPUs is this line below:
p_model = multi_gpu_model(model, gpus=4)
While the training is going, I can see the following (GPUs are fully utilized?):
(tf_gpu) [martin#A08-R32-I196-3-FZ2LTP2 mlm]$ nvidia-smi
Wed Jan 23 09:08:24 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P40 Off | 00000000:02:00.0 Off | 0 |
| N/A 29C P0 49W / 250W | 21817MiB / 22919MiB | 9% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P40 Off | 00000000:04:00.0 Off | 0 |
| N/A 34C P0 50W / 250W | 21817MiB / 22919MiB | 2% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla P40 Off | 00000000:83:00.0 Off | 0 |
| N/A 28C P0 48W / 250W | 21817MiB / 22919MiB | 3% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla P40 Off | 00000000:84:00.0 Off | 0 |
| N/A 36C P0 51W / 250W | 21817MiB / 22919MiB | 3% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 114918 C python 21807MiB |
| 1 114918 C python 21807MiB |
| 2 114918 C python 21807MiB |
| 3 114918 C python 21807MiB |
+-----------------------------------------------------------------------------+
However, compared to with single-GPU run, or even compared to with my Mac desktop run, it doesn't speed up at all. The total training takes about 20 minutes, almost the same as the time used by single-gpu training, and much slower than the training on my personal Mac. Why is that?
Increase your batch_size more than 32. Keep increasing until you have full GPU utilization. Yes, this might affect your model but it does significantly increases your performance. You will have to find a sweet spot for that batch_size.

GPU load in tensorflow

I just built TensorFlow v1.0 and I am trying to run MNIST test just to see if it's working. Seems like it is, but i am observing weird behaiviour.
My system has two Tesla P100, and nvidia-smi shows the following:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 361.107 Driver Version: 361.107 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-SXM2... Off | 0002:01:00.0 Off | 0 |
| N/A 34C P0 114W / 300W | 15063MiB / 16280MiB | 51% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P100-SXM2... Off | 0006:01:00.0 Off | 0 |
| N/A 27C P0 35W / 300W | 14941MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 67288 C python3 15061MiB |
| 1 67288 C python3 14939MiB |
+-----------------------------------------------------------------------------+
As it shown, python3 ate all the memory on both GPUs, but computational load are placed only on first.
Exporting CUDA_VISIBLE_DEVICES I can limit GPU to be used, but it's not affect computational time. So no gain from adding second GPU. Single GPU
real 2m23.496s
user 4m26.597s
sys 0m12.587s
Two GPUs:
real 2m18.165s
user 4m18.625s
sys 0m12.958s
So the question is, how to load both GPUs?