Hi, I have two GPUs and sometimes I want to run one script on GPU:0 and the other one on GPU:1. the question is how can I execute a python script on a specific GPU, or how to bind script execution to a particular GPU. Looking ahead I'll say what I know about with tf.device('/device:GPU:1'):
I thought I can solve my issue using tf.config.experimental.set_visible_devices but I was wrong. The plan was simple: at the top of the code, I planned to set up the required GPU. And then, as I thought, the script will run on that GPU which I made visible. But I was wrong - see the code below. In my case when I run the script I got an error:
TensorFlow device (GPU:0) is being mapped to multiple CUDA devices (1 now, and 0 previously),
which is not supported. This may be the result of providing different GPU configurations
(ConfigProto.gpu_options, for example different visible_device_list) when creating multiple
Sessions in the same process. This is not currently supported, see https://github.com/tensorflow/tensorflow/issues/19083
__
import tensorflow as tf
def work():
a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
for i in range(100000):
c = tf.matmul(a, b)
def main():
print(tf.__version__)
tf.config.set_soft_device_placement(False)
tf.debugging.set_log_device_placement(True)
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
try:
print(f'set visible GPU device as {gpus[1]}')
tf.config.experimental.set_visible_devices(gpus[1], 'GPU')
except RuntimeError as e:
print(e)
device_name = tf.test.gpu_device_name()
print(device_name)
with tf.device('/device:GPU:1'):
work()
if __name__ == "__main__":
main()
I believe I'm not the first person who meets with this issue - could somebody share the solution?
as turn it out the casket opened simple - here is what I find out. When you have 2 GPU and call tf.config.experimental.set_visible_devices(gpus[1], 'GPU') they're only one GPU will be available for your script Device:1 (In my example) . BUT... IT IS IMPORTANT - the name(number) of this device will be Device:0 instead of Device:1 as it can be expected. And here is the reason: the system re-enumerates all available for your script GPUs beginning from the scratch - it means the first name of your device will be started from 0 again, despite your real device that has the name Device:1 - thus the available in your script device will have name Device:0, not Device:1.
I hope this can help someone like me.
Related
I want to test the FIFIQueue, when I use "with tf.device("/device:GPU:0"):", the first time I run, it's just ok, but when I run it twice, error occurred just print cannot assign gpu to fifo_queue_EnqueueMany(the error detail is in the image below), anyone warm-hearted to help me?
enter image description here
enter image description here
Per drpng's note on Tensorflow: using a FIFO queue for code running on GPUs I wouldn't expect FIFOQueue to schedule on a GPU, and indeed wrapping your code in a .py file (to see TF's logging output) and logging device placement confirms that even the first (successful) invocation schedules on a CPU.
In one cell run:
%%writefile go.py
import tensorflow as tf
config = tf.ConfigProto()
#config.allow_soft_placement=True
config.gpu_options.allow_growth = True
config.log_device_placement=True
def go():
Q = tf.FIFOQueue(3, tf.float16)
enq_many = Q.enqueue_many([[0.1, 0.2, 0.3],])
with tf.device('/device:GPU:0'):
with tf.Session(config=config) as sess:
sess.run(enq_many)
print(Q.size().eval())
go()
go()
And in another cell execute the above as:
!python3 go.py
and observe placement.
Uncomment the allow_soft_placement assignment to make the crash go away.
(I do not know why the first execution succeeds even in the face of non-soft-placement when asking FIFOQueue to schedule on the GPU explicitly as in your code's "first time")
I am trying to run Pytorch code on three nodes using openMPI but the code just halts without any errors or output. Eventually my purpose is to distribute a Pytorch graph on these nodes.
Three of my nodes are connected in same LAN and have SSH access to each other without password and have similar specifications:
Ubuntu 18.04
Cuda 10.0
OpenMPI built and installed from source
PyTorch built and installed from source
The code shown below works on single node - multiple processes, as:
> mpirun -np 3 -H 192.168.100.101:3 python3 run.py
With following output:
INIT 0 of 3 Init env://
INIT 1 of 3 Init env://
INIT 2 of 3 Init env://
RUN 0 of 3 with tensor([0., 0., 0.])
RUN 1 of 3 with tensor([0., 0., 0.])
RUN 2 of 3 with tensor([0., 0., 0.])
Rank 1 has data tensor(1.)
Rank 0 has data tensor(1.)
Rank 2 has data tensor(1.)
But when I placed the code on three nodes and run following command on each node separately, it does nothing:
> mpirun -np 3 -H 192.168.100.101:1,192.168.100.102:1,192.168.100.103:1 python3 run.py
Please give some idea about any modifications in code or configurations for MPI to run given Pytorch code on multiple nodes?
#!/usr/bin/env python
import os
import torch
import torch.distributed as dist
from torch.multiprocessing import Process
def run(rank, size):
tensor = torch.zeros(size)
print(f"RUN {rank} of {size} with {tensor}")
# incrementing the old tensor
tensor += 1
# sending tensor to next rank
if rank == size-1:
dist.send(tensor=tensor, dst=0)
else:
dist.send(tensor=tensor, dst=rank+1)
# receiving tensor from previous rank
if rank == 0:
dist.recv(tensor=tensor, src=size-1)
else:
dist.recv(tensor=tensor, src=rank-1)
print('Rank ', rank, ' has data ', tensor[0])
def init_processes(rank, size, fn, backend, init):
print(f"INIT {rank} of {size} Init {init}")
dist.init_process_group(backend, init, rank=rank, world_size=size)
fn(rank, size)
if __name__ == "__main__":
os.environ['MASTER_ADDR'] = '192.168.100.101'
os.environ['BACKEND'] = 'mpi'
os.environ['INIT_METHOD'] = 'env://'
world_size = int(os.environ['OMPI_COMM_WORLD_SIZE'])
world_rank = int(os.environ['OMPI_COMM_WORLD_RANK'])
init_processes(world_rank, world_size, run, os.environ['BACKEND'], os.environ['INIT_METHOD'])
N.B. NCCL is not an option for me due to arm64-based hardware.
Apologies for replying late to this, but I could solve the issue by adding --mca btl_tcp_if_include eth1 flag to mpirun command.
The reason for halt was that openMPI, by default, tries to locate and communicate with other nodes over local loopback network interface e.g. lo. We have to explicitly specify which interface(s) should be included (or excluded) to locate other other nodes.
I hope it would save someone's day :)
System information
custom code: no, it is the one in https://www.tensorflow.org/get_started/estimator
system: Apple
OS: Mac OsX 10.13
TensorFlow version: 1.3.0
Python version: 3.6.3
GPU model: AMD FirePro D700 (actually, two such GPUs)
Describe the problem
Dear all,
I am running the simple iris program:
https://www.tensorflow.org/get_started/estimator
under python 3.6.3 and tensorflow 1.3.0.
The program executes correctly, apart from the very last part, i.e. the one related to the confusion matrix.
In fact, the result I get for the confusion matrix is:
New Samples, Class Predictions: [array([b'1'], dtype=object), array([b'2'], dtype=object)]
rather than the expected output:
New Samples, Class Predictions: [1 2]
Has anything about confusion matrix changed in the latest release?
If so, how should I modify that part of the code?
Thank you very much for your help!
Best regards
Ivan
Source code / logs
https://www.tensorflow.org/get_started/estimator
This looks like a numpy issue. array([b'1'], dtype=object) is one way numpy represents the string '1'.
I want to use theano with GPU ,and I use the following script to test if GPU is working:
import os
os.environ['THEANO_FLAGS'] = "device=gpu0"
import theano
from theano import function, config, shared, tensor
import numpy
import time
vlen = 10 * 30 * 768 # 10 x #cores x # threads per core
iters = 1000
rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], tensor.exp(x))
print(f.maker.fgraph.toposort())
t0 = time.time()
for i in range(iters):
r = f()
t1 = time.time()
print("Looping %d times took %f seconds" % (iters, t1 - t0))
print("Result is %s" % (r,))
if numpy.any([isinstance(x.op, tensor.Elemwise) and
('Gpu' not in type(x.op).__name__)
for x in f.maker.fgraph.toposort()]):
print('Used the cpu')
else:
print('Used the gpu')
but I get the following result:
WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release (v0.10). Please switch to the gpuarray backend. You can get more information about how to switch at this URL:
https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29
/usr/lib/python2.7/site-packages/theano/sandbox/cuda/__init__.py:556: UserWarning: Theano flag device=gpu* (old gpu back-end) only support floatX=float32. You have floatX=float64. Use the new gpu back-end with device=cuda* for that value of floatX.
warnings.warn(msg)
Using gpu device 0: GeForce GT 720 (CNMeM is enabled with initial size: 50.0% of memory, cuDNN 6021)
/usr/lib/python2.7/site-packages/theano/sandbox/cuda/__init__.py:631: UserWarning: Your cuDNN version is more recent than the one Theano officially supports. If you see any problems, try updating Theano or downgrading cuDNN to version 5.1.
warnings.warn(warn)
[Elemwise{exp,no_inplace}(<TensorType(float64, vector)>)]
Looping 1000 times took 3.424644 seconds
Result is [ 1.23178032 1.61879341 1.52278065 ..., 2.20771815 2.29967753
1.62323285]
Used the cpu
My Question
What does the result mean? and How do I make it work to use the GPU?
I had similar issue with new version of theano. You can try with
THEANO_FLAGS="floatX=float32,device=gpu,nvcc.flags=-D_FORCE_INLINES" python test_gpu.py
I have access through ssh to a cluster of n GPUs. Tensorflow automatically gave them names gpu:0,...,gpu:(n-1).
Others have access too and sometimes they take random gpus.
I did not place any tf.device() explicitely because that is cumbersome and even if I selected gpu number j and that someone is already on gpu number j that would be problematic.
I would like to go throuh the gpus usage and find the first that is unused and use only this one.
I guess someone could parse the output of nvidia-smi with bash and get a variable i and feed that variable i to the tensorflow script as the number of the gpu to use.
I have never seen any example of this. I imagine it is a pretty common problem. What would be the simplest way to do that ? Is a pure tensorflow one available ?
I'm not aware of pure-TensorFlow solution. The problem is that existing place for TensorFlow configurations is a Session config. However, for GPU memory, a GPU memory pool is shared for all TensorFlow sessions within a process, so Session config would be the wrong place to add it, and there's no mechanism for process-global config (but there should be, to also be able to configure process-global Eigen threadpool). So you need to do on on a process level by using CUDA_VISIBLE_DEVICES environment variable.
Something like this:
import subprocess, re
# Nvidia-smi GPU memory parsing.
# Tested on nvidia-smi 370.23
def run_command(cmd):
"""Run command, return output as string."""
output = subprocess.Popen(cmd, stdout=subprocess.PIPE, shell=True).communicate()[0]
return output.decode("ascii")
def list_available_gpus():
"""Returns list of available GPU ids."""
output = run_command("nvidia-smi -L")
# lines of the form GPU 0: TITAN X
gpu_regex = re.compile(r"GPU (?P<gpu_id>\d+):")
result = []
for line in output.strip().split("\n"):
m = gpu_regex.match(line)
assert m, "Couldnt parse "+line
result.append(int(m.group("gpu_id")))
return result
def gpu_memory_map():
"""Returns map of GPU id to memory allocated on that GPU."""
output = run_command("nvidia-smi")
gpu_output = output[output.find("GPU Memory"):]
# lines of the form
# | 0 8734 C python 11705MiB |
memory_regex = re.compile(r"[|]\s+?(?P<gpu_id>\d+)\D+?(?P<pid>\d+).+[ ](?P<gpu_memory>\d+)MiB")
rows = gpu_output.split("\n")
result = {gpu_id: 0 for gpu_id in list_available_gpus()}
for row in gpu_output.split("\n"):
m = memory_regex.search(row)
if not m:
continue
gpu_id = int(m.group("gpu_id"))
gpu_memory = int(m.group("gpu_memory"))
result[gpu_id] += gpu_memory
return result
def pick_gpu_lowest_memory():
"""Returns GPU with the least allocated memory"""
memory_gpu_map = [(memory, gpu_id) for (gpu_id, memory) in gpu_memory_map().items()]
best_memory, best_gpu = sorted(memory_gpu_map)[0]
return best_gpu
You can then put it in utils.py and set GPU in your TensorFlow script before first tensorflow import. IE
import utils
import os
os.environ["CUDA_VISIBLE_DEVICES"] = str(utils.pick_gpu_lowest_memory())
import tensorflow
An implementation along the lines of Yaroslav Bulatov's solution is available on https://github.com/bamos/setGPU.