TensorFlow with 4-GPU doesn't speed up the training - tensorflow

My code first:
from sklearn.datasets.samples_generator import make_blobs
from matplotlib import pyplot
from numpy import where
from keras.utils import to_categorical
from keras.layers import Dense
from keras.models import Sequential
from keras.optimizers import SGD
from keras.utils import multi_gpu_model
X, y = make_blobs(n_samples=1000000, centers=3, n_features=3, cluster_std=2, random_state=2)
y = to_categorical(y)
n_train = 500000
trainX, testX = X[:n_train, :], X[n_train:, :]
trainY, testY = y[:n_train], y[n_train:]
model = Sequential()
model.add(Dense(50, input_dim=3, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(3, activation='softmax'))
p_model = multi_gpu_model(model, gpus=4)
opt = SGD(lr=0.01, momentum=0.9)
p_model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])
history = p_model.fit(trainX, trainY, validation_data=(testX, testY), epochs=20, verbose=1, batch_size=32)
_, train_acc = p_model.evaluate(trainX, trainY, verbose=0)
_, test_acc = p_model.evaluate(testX, testY, verbose=0)
print("Train: %.3f, Test: %.3f" % (train_acc, test_acc))
The way to use the 4 powerful GPUs is this line below:
p_model = multi_gpu_model(model, gpus=4)
While the training is going, I can see the following (GPUs are fully utilized?):
(tf_gpu) [martin#A08-R32-I196-3-FZ2LTP2 mlm]$ nvidia-smi
Wed Jan 23 09:08:24 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P40 Off | 00000000:02:00.0 Off | 0 |
| N/A 29C P0 49W / 250W | 21817MiB / 22919MiB | 9% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P40 Off | 00000000:04:00.0 Off | 0 |
| N/A 34C P0 50W / 250W | 21817MiB / 22919MiB | 2% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla P40 Off | 00000000:83:00.0 Off | 0 |
| N/A 28C P0 48W / 250W | 21817MiB / 22919MiB | 3% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla P40 Off | 00000000:84:00.0 Off | 0 |
| N/A 36C P0 51W / 250W | 21817MiB / 22919MiB | 3% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 114918 C python 21807MiB |
| 1 114918 C python 21807MiB |
| 2 114918 C python 21807MiB |
| 3 114918 C python 21807MiB |
+-----------------------------------------------------------------------------+
However, compared to with single-GPU run, or even compared to with my Mac desktop run, it doesn't speed up at all. The total training takes about 20 minutes, almost the same as the time used by single-gpu training, and much slower than the training on my personal Mac. Why is that?

Increase your batch_size more than 32. Keep increasing until you have full GPU utilization. Yes, this might affect your model but it does significantly increases your performance. You will have to find a sweet spot for that batch_size.

Related

Allocating Large Tensor on multiple GPUs using Distributed Learning in Keras

I am using Tensorflow Distributed learning using the following commands -
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="0,1,2,3"
strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
model = Basic_Model()
model.compile(loss='mean_squared_error', optimizer=rms, metrics=['mean_squared_error'])
The system being used has 4 32 GB GPU devices. The following is the output of nvidia-smi -
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01 Driver Version: 418.87.01 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000004:04:00.0 Off | 0 |
| N/A 37C P0 65W / 300W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000004:05:00.0 Off | 0 |
| N/A 38C P0 40W / 300W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 00000035:03:00.0 Off | 0 |
| N/A 33C P0 40W / 300W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... On | 00000035:04:00.0 Off | 0 |
| N/A 39C P0 41W / 300W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
But after running the script to create the model, I am getting the following error -
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape [131072,65536] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:RandomUniform]
A tensor of shape [131072,65536] of type float would allocate 131072 * 65536 * 4 bytes i.e., 34.35 GB. And there are 4 32 GB GPUs, so why is it not allocated?
MirroredStrategy creates a copy of all variables within the scope per GPU. So since the tensor size is 34.35GB, that's too large. You might be trying to use something similar to tf.distribute.experimental.CentralStorageStrategy. MirroredStrategy, in terms of gpu memory, isn't vram * num_of_gpu, it practically is smallest_vram, so in your case, Keras is working with 32GB of memory per replica, not 32*4=128GB.
strategy = tf.distribute.experimental.CentralStorageStrategy()
dataset = # some dataset
dataset = strategy.experimental_distribute_dataset(dataset)
with strategy.scope():
model = Basic_Model()
model.compile(loss='mean_squared_error', optimizer=rms, metrics=['mean_squared_error'])
Example:
Tensor A is [0, 1, 2, 3] and you have four GPUs. MirroredStrategy will load:
GPU0: [0, 1, 2, 3]
GPU1: [0, 1, 2, 3]
GPU2: [0, 1, 2, 3]
GPU3: [0, 1, 2, 3]
NOT
GPU0: [0]
GPU1: [1]
GPU2: [2]
GPU3: [3]
As you can see, MirroredStrategy requires all your available devices to be able to hold all of the data, therefore, you're limited to your smallest device when using this strategy.

"Adding visible gpu devices: 0.." has been constantly outputted in nohup.out

I'm running a pso program and I use tensorflow to calculate the mean square error as fitness, but every minute there's an output in nohup.out beginning with "Adding visible gpu devices: 0". I use both gpu and cpu to run my code, but 5 days later they have almost the same rate. Why dose gpu run so slow? How can I stop the constant output?
I use the No.2 device, and the gpu seems to work.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.130 Driver Version: 384.130 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN Xp Off | 00000000:02:00.0 Off | N/A |
| 23% 39C P0 61W / 250W | 0MiB / 12189MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 TITAN Xp Off | 00000000:04:00.0 Off | N/A |
| 23% 40C P0 60W / 250W | 0MiB / 12189MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 TITAN Xp Off | 00000000:83:00.0 Off | N/A |
| 23% 39C P8 17W / 250W | 7261MiB / 12189MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 TITAN Xp Off | 00000000:84:00.0 Off | N/A |
| 23% 37C P0 58W / 250W | 0MiB / 12189MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 2 26737 C python 7251MiB
nohup.out is printing the imformation every minute:
2019-11-08 11:49:18.032239: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0
2019-11-08 11:49:18.032316: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-11-08 11:49:18.032326: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0
2019-11-08 11:49:18.032332: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N
2019-11-08 11:49:18.032511: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6991 MB memory) -> physical GPU (device: 0, name: TITAN Xp, pci bus id: 0000:83:00.0, compute capability: 6.1)
2019-11-08 11:50:42.409142: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0
2019-11-08 11:50:42.409214: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-11-08 11:50:42.409223: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0
2019-11-08 11:50:42.409229: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N
2019-11-08 11:50:42.409407: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6991 MB memory) -> physical GPU (device: 0, name: TITAN Xp, pci bus id: 0000:83:00.0, compute capability: 6.1)
The objective function is like this:
def function(self, M, w_h, w_o):
def model(X, w_h, w_o):
h = tf.matmul(X, w_h)
return tf.matmul(h, w_o)
X = tf.placeholder(tf.float64, [None, 4])
Y = tf.placeholder(tf.float64, [None, 2])
w_h = tf.Variable(w_h)
w_o = tf.Variable(w_o)
py_x = model(X, w_h, w_o)
loss = tf.reduce_mean((py_x-Y)**2)
with tf.Session() as sess:
tf.initializers.global_variables().run()
sum = 0
length = len(trY)
for start, end in zip(range(0, length, batchsize), range(batchsize, length + 1, batchsize)):
sum += sess.run(loss, feed_dict={X: trX[start:end], Y: trY[start:end]})
if not length%batchsize:
E = sum/(length/batchsize)
else:
E = sum/(1+floor(length/batchsize))
return E

CUDA_ERROR_OUT_OF_MEMORY during tf export mode despite turning gpu completely off already

I am currently running a few Tensorflow training jobs with gpu and am trying to export models from one such job. I have set
os.environ['CUDA_VISIBLE_DEVICES'] = '-1'
both in code and in terminal. Also I have removed all mentions of gpu devices in the training code, as well as moved graph.pbtxt away. I used inspect_checkpoint.py to see that the model checkpoint keys contain no mention of gpu either. I have also set
session_config = tf.ConfigProto(
device_count={'GPU': 0 if export else config.num_gpus},
allow_soft_placement=True,
gpu_options=None if export else tf.GPUOptions(allow_growth=True))
Still I am getting the following error message towards the end of export:
2018-09-15 03:20:30.597742: E
tensorflow/core/common_runtime/direct_session.cc:158] Internal: failed
initializing StreamExecutor for CUDA device ordinal 0: Internal: failed
call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY; total memory
reported: 16936861696
nvidia-smi | head -20
Sat Sep 15 03:25:28 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.130 Driver Version: 384.130 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... Off | 00000000:02:00.0 Off | 0 |
| N/A 35C P0 50W / 250W | 15800MiB / 16152MiB | 89% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-PCIE... Off | 00000000:04:00.0 Off | 0 |
| N/A 35C P0 37W / 250W | 0MiB / 16152MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-PCIE... Off | 00000000:83:00.0 Off | 0 |
| N/A 36C P0 37W / 250W | 0MiB / 16152MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-PCIE... Off | 00000000:84:00.0 Off | 0 |
| N/A 38C P0 39W / 250W | 0MiB / 16152MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

GPU load in tensorflow

I just built TensorFlow v1.0 and I am trying to run MNIST test just to see if it's working. Seems like it is, but i am observing weird behaiviour.
My system has two Tesla P100, and nvidia-smi shows the following:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 361.107 Driver Version: 361.107 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-SXM2... Off | 0002:01:00.0 Off | 0 |
| N/A 34C P0 114W / 300W | 15063MiB / 16280MiB | 51% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P100-SXM2... Off | 0006:01:00.0 Off | 0 |
| N/A 27C P0 35W / 300W | 14941MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 67288 C python3 15061MiB |
| 1 67288 C python3 14939MiB |
+-----------------------------------------------------------------------------+
As it shown, python3 ate all the memory on both GPUs, but computational load are placed only on first.
Exporting CUDA_VISIBLE_DEVICES I can limit GPU to be used, but it's not affect computational time. So no gain from adding second GPU. Single GPU
real 2m23.496s
user 4m26.597s
sys 0m12.587s
Two GPUs:
real 2m18.165s
user 4m18.625s
sys 0m12.958s
So the question is, how to load both GPUs?

Specify gpu in Tensorflow code: /gpu:0 is always working?

I have 3 graphics cards in my workstation, one of them is Quadro K620, and the other two are Titan X. Now I would like to run my tensorflow code in one of the graphics card, so that I can leave the others idle for another task.
However, regardless of setting tf.device('/gpu:0') or tf.device('/gpu:1'), I found the 1st Titan X graphics card is always working, I don't know why.
import argparse
import os
import time
import tensorflow as tf
import numpy as np
import cv2
from Dataset import Dataset
from Net import Net
FLAGS = None
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument('--foldername', type=str, default='./data-large/')
parser.add_argument('--batch_size', type=int, default=100)
parser.add_argument('--num_epoches', type=int, default=100)
parser.add_argument('--learning_rate', type=float, default=0.5)
FLAGS = parser.parse_args()
net = Net(FLAGS.batch_size, FLAGS.learning_rate)
with tf.Graph().as_default():
# Dataset is a class for encapsulate the input pipeline
dataset = Dataset(foldername=FLAGS.foldername,
batch_size=FLAGS.batch_size,
num_epoches=FLAGS.num_epoches)
images, labels = dataset.samples_train
## The following code defines the network and train
with tf.device('/gpu:0'): # <==== THIS LINE
logits = net.inference(images)
loss = net.loss(logits, labels)
train_op = net.training(loss)
init_op = tf.group(tf.initialize_all_variables(), tf.initialize_local_variables())
sess = tf.Session()
sess.run(init_op)
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)
start_time = time.time()
try:
step = 0
while not coord.should_stop():
_, loss_value = sess.run([train_op, loss])
step = step + 1
if step % 100 == 0:
format_str = ('step %d, loss = %.2f, time: %.2f seconds')
print(format_str % (step, loss_value, (time.time() - start_time)))
start_time = time.time()
except tf.errors.OutOfRangeError:
print('done')
finally:
coord.request_stop()
coord.join(threads)
sess.close()
Regarding to the line "<=== THIS LINE:"
If I set tf.device('/gpu:0'), the monitor says:
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro K620 Off | 0000:03:00.0 On | N/A |
| 34% 45C P0 2W / 30W | 404MiB / 1993MiB | 5% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX TIT... Off | 0000:04:00.0 Off | N/A |
| 22% 39C P2 100W / 250W | 11691MiB / 12206MiB | 8% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX TIT... Off | 0000:81:00.0 Off | N/A |
| 22% 43C P2 71W / 250W | 111MiB / 12206MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
showing the 1st Titan X card is working.
If I set tf.device('/gpu:1'), the monitor says:
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro K620 Off | 0000:03:00.0 On | N/A |
| 34% 45C P0 2W / 30W | 411MiB / 1993MiB | 3% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX TIT... Off | 0000:04:00.0 Off | N/A |
| 22% 52C P2 73W / 250W | 11628MiB / 12206MiB | 12% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX TIT... Off | 0000:81:00.0 Off | N/A |
| 22% 42C P2 71W / 250W | 11628MiB / 12206MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
showing that the two Titan X cards are working, not the 2nd Titan X alone.
So any reason behind this and how to specify the gpu I want my program to run in?
Just a guess, but the default behavior for a tf.train.Optimizer object (which I expect is created in net.training(loss)) when you call minimize() is colocate_gradients_with_ops=False. This may lead to the backpropagation ops being placed on the default device, which will be /gpu:0.
To work out if this is happening, you can iterate over sess.graph_def and look for nodes that either have /gpu:0 in the NodeDef.device field, or have an empty device field (in which case they will be placed on /gpu:0 by default).
Another option for checking what devices are being used is to use the output_partition_graphs=True option when running your step. This shows what devices TensorFlow is actually using (instead of, in sess.graph_def, what devices your program is requesting), and should show exactly what nodes are running on /gpu:0.