I'm running a pso program and I use tensorflow to calculate the mean square error as fitness, but every minute there's an output in nohup.out beginning with "Adding visible gpu devices: 0". I use both gpu and cpu to run my code, but 5 days later they have almost the same rate. Why dose gpu run so slow? How can I stop the constant output?
I use the No.2 device, and the gpu seems to work.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.130 Driver Version: 384.130 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN Xp Off | 00000000:02:00.0 Off | N/A |
| 23% 39C P0 61W / 250W | 0MiB / 12189MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 TITAN Xp Off | 00000000:04:00.0 Off | N/A |
| 23% 40C P0 60W / 250W | 0MiB / 12189MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 TITAN Xp Off | 00000000:83:00.0 Off | N/A |
| 23% 39C P8 17W / 250W | 7261MiB / 12189MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 TITAN Xp Off | 00000000:84:00.0 Off | N/A |
| 23% 37C P0 58W / 250W | 0MiB / 12189MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 2 26737 C python 7251MiB
nohup.out is printing the imformation every minute:
2019-11-08 11:49:18.032239: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0
2019-11-08 11:49:18.032316: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-11-08 11:49:18.032326: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0
2019-11-08 11:49:18.032332: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N
2019-11-08 11:49:18.032511: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6991 MB memory) -> physical GPU (device: 0, name: TITAN Xp, pci bus id: 0000:83:00.0, compute capability: 6.1)
2019-11-08 11:50:42.409142: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0
2019-11-08 11:50:42.409214: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-11-08 11:50:42.409223: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0
2019-11-08 11:50:42.409229: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N
2019-11-08 11:50:42.409407: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6991 MB memory) -> physical GPU (device: 0, name: TITAN Xp, pci bus id: 0000:83:00.0, compute capability: 6.1)
The objective function is like this:
def function(self, M, w_h, w_o):
def model(X, w_h, w_o):
h = tf.matmul(X, w_h)
return tf.matmul(h, w_o)
X = tf.placeholder(tf.float64, [None, 4])
Y = tf.placeholder(tf.float64, [None, 2])
w_h = tf.Variable(w_h)
w_o = tf.Variable(w_o)
py_x = model(X, w_h, w_o)
loss = tf.reduce_mean((py_x-Y)**2)
with tf.Session() as sess:
tf.initializers.global_variables().run()
sum = 0
length = len(trY)
for start, end in zip(range(0, length, batchsize), range(batchsize, length + 1, batchsize)):
sum += sess.run(loss, feed_dict={X: trX[start:end], Y: trY[start:end]})
if not length%batchsize:
E = sum/(length/batchsize)
else:
E = sum/(1+floor(length/batchsize))
return E
Related
I am using Tensorflow Distributed learning using the following commands -
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="0,1,2,3"
strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
model = Basic_Model()
model.compile(loss='mean_squared_error', optimizer=rms, metrics=['mean_squared_error'])
The system being used has 4 32 GB GPU devices. The following is the output of nvidia-smi -
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01 Driver Version: 418.87.01 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000004:04:00.0 Off | 0 |
| N/A 37C P0 65W / 300W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000004:05:00.0 Off | 0 |
| N/A 38C P0 40W / 300W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 00000035:03:00.0 Off | 0 |
| N/A 33C P0 40W / 300W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... On | 00000035:04:00.0 Off | 0 |
| N/A 39C P0 41W / 300W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
But after running the script to create the model, I am getting the following error -
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape [131072,65536] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:RandomUniform]
A tensor of shape [131072,65536] of type float would allocate 131072 * 65536 * 4 bytes i.e., 34.35 GB. And there are 4 32 GB GPUs, so why is it not allocated?
MirroredStrategy creates a copy of all variables within the scope per GPU. So since the tensor size is 34.35GB, that's too large. You might be trying to use something similar to tf.distribute.experimental.CentralStorageStrategy. MirroredStrategy, in terms of gpu memory, isn't vram * num_of_gpu, it practically is smallest_vram, so in your case, Keras is working with 32GB of memory per replica, not 32*4=128GB.
strategy = tf.distribute.experimental.CentralStorageStrategy()
dataset = # some dataset
dataset = strategy.experimental_distribute_dataset(dataset)
with strategy.scope():
model = Basic_Model()
model.compile(loss='mean_squared_error', optimizer=rms, metrics=['mean_squared_error'])
Example:
Tensor A is [0, 1, 2, 3] and you have four GPUs. MirroredStrategy will load:
GPU0: [0, 1, 2, 3]
GPU1: [0, 1, 2, 3]
GPU2: [0, 1, 2, 3]
GPU3: [0, 1, 2, 3]
NOT
GPU0: [0]
GPU1: [1]
GPU2: [2]
GPU3: [3]
As you can see, MirroredStrategy requires all your available devices to be able to hold all of the data, therefore, you're limited to your smallest device when using this strategy.
When I am runing a tensorflow image train job in the container tensorflow/tensorflow:latest-gpu, it doesn't work.
Error message:
Cannot assign a device for operation InceptionV3/InceptionV3/Conv2d_1a_3x3/Conv2D: Operation was explicitly assigned to /device:GPU:0 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:XLA_CPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:0 ]. Make sure the device specification refers to a valid device.
[[node InceptionV3/InceptionV3/Conv2d_1a_3x3/Conv2D (defined at /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/layers/python/layers/layers.py:1057) = Conv2D[T=DT_FLOAT, data_format="NHWC", dilations=[1, 1, 1, 1], padding="VALID", strides=[1, 2, 2, 1], use_cudnn_on_gpu=true, _device="/device:GPU:0"](fifo_queue_Dequeue, InceptionV3/Conv2d_1a_3x3/weights/read)]]
GPU info:
nvidia-smi
Mon Nov 26 07:48:59 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.72 Driver Version: 410.72 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GT 630 Off | 00000000:01:00.0 N/A | N/A |
| 25% 47C P0 N/A / N/A | 0MiB / 1998MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 Not Supported |
+-----------------------------------------------------------------------------+
It seems that you Tensorflow is not detecting any gpu as available but maps the operations to GPU:0. First try this:
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())
And you'll get the available devices. Is there /device:GPU:0 ?
I would like to run tensorflow on a windows 10 server with 5 NVIDIA Quadro P6000 GPUs. After installing CUDA 8.0 and running deviceQuery.exe it only shows one of the GPUs and therefore tensorflow only uses one GPU as well.
(tensorflow-gpu) C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\extras\demo_suite>deviceQuery.exe
deviceQuery.exe Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "Quadro P6000"
CUDA Driver Version / Runtime Version 8.0 / 8.0
CUDA Capability Major/Minor version number: 6.1
Total amount of global memory: 24576 MBytes (25769803776 bytes)
(30) Multiprocessors, (128) CUDA Cores/MP: 3840 CUDA Cores
GPU Max Clock rate: 1645 MHz (1.64 GHz)
Memory Clock rate: 4513 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 3145728 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
CUDA Device Driver Mode (TCC or WDDM): WDDM (Windows Display Driver Model)
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 4 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = Quadro P6000
Result = PASS
However, nvidia-smi.exe and GPU-Z do recognize all 5 of them.
(tensorflow-gpu) C:\Program Files\NVIDIA Corporation\NVSMI>nvidia-smi.exe
Wed Sep 27 11:07:44 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 376.51 Driver Version: 376.51 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro P6000 WDDM | 0000:04:00.0 Off | Off |
| 26% 45C P5 19W / 250W | 353MiB / 24576MiB | 3% Default |
+-------------------------------+----------------------+----------------------+
| 1 Quadro P6000 WDDM | 0000:05:00.0 Off | Off |
| 26% 24C P8 7W / 250W | 69MiB / 24576MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Quadro P6000 WDDM | 0000:08:00.0 Off | Off |
| 26% 26C P8 8W / 250W | 69MiB / 24576MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Quadro P6000 WDDM | 0000:09:00.0 Off | Off |
| 26% 26C P8 8W / 250W | 69MiB / 24576MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 Quadro P6000 WDDM | 0000:89:00.0 Off | Off |
| 26% 22C P8 8W / 250W | 262MiB / 24576MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2640 C+G Insufficient Permissions N/A |
| 0 3908 C+G ...x86)\Google\Chrome\Application\chrome.exe N/A |
| 4 2640 C+G Insufficient Permissions N/A |
| 4 3552 C+G ...\ImmersiveControlPanel\SystemSettings.exe N/A |
| 4 5260 C+G Insufficient Permissions N/A |
| 4 6552 C+G Insufficient Permissions N/A |
| 4 7208 C+G ...ost_cw5n1h2txyewy\ShellExperienceHost.exe N/A |
| 4 7444 C+G ...indows.Cortana_cw5n1h2txyewy\SearchUI.exe N/A |
+-----------------------------------------------------------------------------+
Does anyone have an idea what I could try to make all of them work for CUDA and tensorflow?
Running on Ubuntu 16.04, latest (1.1.0) tensorflow (installed via pip3 install tensorflow-gpu), CUDA8 + CUDNN5.
The code looks more or less like this:
import tensorflow as tf
from tensorflow.contrib.learn import KMeansClustering
trainencflt = #pandas frame with ~30k rows and ~300 columns
def train_input_fn():
return (tf.constant(trainencflt, shape = [trainencflt.shape[0], trainencflt.shape[1]]), None)
configuration = tf.contrib.learn.RunConfig(log_device_placement=True)
model = KMeansClustering(num_clusters=k,
initial_clusters=KMeansClustering.RANDOM_INIT,
relative_tolerance=1e-8,
config=configuration)
model.fit(input_fn = train_input_fn, steps = 100)
When it runs I see:
2017-06-15 10:24:41.564890: I tensorflow/core/common_runtime/gpu/gpu_device.cc:887] Found device 0 with properties:
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.7335
pciBusID 0000:81:00.0
Total memory: 7.92GiB
Free memory: 7.81GiB
2017-06-15 10:24:41.564934: I tensorflow/core/common_runtime/gpu/gpu_device.cc:908] DMA: 0
2017-06-15 10:24:41.564942: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 0: Y
2017-06-15 10:24:41.564956: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:81:00.0)
Memory gets allocated:
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 1 548 C python 7745MiB |
+-----------------------------------------------------------------------------+
But then none of the operations are performed on the GPU (it stays at 0% all the time, the CPU utilization does skyrocket on all cores):
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.66 Driver Version: 375.66 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 1 GeForce GTX 1080 Off | 0000:02:00.0 Off | N/A |
| 29% 43C P8 13W / 180W | 7747MiB / 8114MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
Not seeing any placement logs (even though I specified log_device_placement to be True).
I did try the simple GPU examples and they were working just fine (at least the placement logs were looking fine).
Am I missing something?
Went through the codebase - TF 1.1.0 simply didn't have a GPU kernel.
I have 3 graphics cards in my workstation, one of them is Quadro K620, and the other two are Titan X. Now I would like to run my tensorflow code in one of the graphics card, so that I can leave the others idle for another task.
However, regardless of setting tf.device('/gpu:0') or tf.device('/gpu:1'), I found the 1st Titan X graphics card is always working, I don't know why.
import argparse
import os
import time
import tensorflow as tf
import numpy as np
import cv2
from Dataset import Dataset
from Net import Net
FLAGS = None
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument('--foldername', type=str, default='./data-large/')
parser.add_argument('--batch_size', type=int, default=100)
parser.add_argument('--num_epoches', type=int, default=100)
parser.add_argument('--learning_rate', type=float, default=0.5)
FLAGS = parser.parse_args()
net = Net(FLAGS.batch_size, FLAGS.learning_rate)
with tf.Graph().as_default():
# Dataset is a class for encapsulate the input pipeline
dataset = Dataset(foldername=FLAGS.foldername,
batch_size=FLAGS.batch_size,
num_epoches=FLAGS.num_epoches)
images, labels = dataset.samples_train
## The following code defines the network and train
with tf.device('/gpu:0'): # <==== THIS LINE
logits = net.inference(images)
loss = net.loss(logits, labels)
train_op = net.training(loss)
init_op = tf.group(tf.initialize_all_variables(), tf.initialize_local_variables())
sess = tf.Session()
sess.run(init_op)
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)
start_time = time.time()
try:
step = 0
while not coord.should_stop():
_, loss_value = sess.run([train_op, loss])
step = step + 1
if step % 100 == 0:
format_str = ('step %d, loss = %.2f, time: %.2f seconds')
print(format_str % (step, loss_value, (time.time() - start_time)))
start_time = time.time()
except tf.errors.OutOfRangeError:
print('done')
finally:
coord.request_stop()
coord.join(threads)
sess.close()
Regarding to the line "<=== THIS LINE:"
If I set tf.device('/gpu:0'), the monitor says:
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro K620 Off | 0000:03:00.0 On | N/A |
| 34% 45C P0 2W / 30W | 404MiB / 1993MiB | 5% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX TIT... Off | 0000:04:00.0 Off | N/A |
| 22% 39C P2 100W / 250W | 11691MiB / 12206MiB | 8% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX TIT... Off | 0000:81:00.0 Off | N/A |
| 22% 43C P2 71W / 250W | 111MiB / 12206MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
showing the 1st Titan X card is working.
If I set tf.device('/gpu:1'), the monitor says:
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro K620 Off | 0000:03:00.0 On | N/A |
| 34% 45C P0 2W / 30W | 411MiB / 1993MiB | 3% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX TIT... Off | 0000:04:00.0 Off | N/A |
| 22% 52C P2 73W / 250W | 11628MiB / 12206MiB | 12% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX TIT... Off | 0000:81:00.0 Off | N/A |
| 22% 42C P2 71W / 250W | 11628MiB / 12206MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
showing that the two Titan X cards are working, not the 2nd Titan X alone.
So any reason behind this and how to specify the gpu I want my program to run in?
Just a guess, but the default behavior for a tf.train.Optimizer object (which I expect is created in net.training(loss)) when you call minimize() is colocate_gradients_with_ops=False. This may lead to the backpropagation ops being placed on the default device, which will be /gpu:0.
To work out if this is happening, you can iterate over sess.graph_def and look for nodes that either have /gpu:0 in the NodeDef.device field, or have an empty device field (in which case they will be placed on /gpu:0 by default).
Another option for checking what devices are being used is to use the output_partition_graphs=True option when running your step. This shows what devices TensorFlow is actually using (instead of, in sess.graph_def, what devices your program is requesting), and should show exactly what nodes are running on /gpu:0.