Tensorflow on shared GPUs: how to automatically select the one that is unused - tensorflow

I have access through ssh to a cluster of n GPUs. Tensorflow automatically gave them names gpu:0,...,gpu:(n-1).
Others have access too and sometimes they take random gpus.
I did not place any tf.device() explicitely because that is cumbersome and even if I selected gpu number j and that someone is already on gpu number j that would be problematic.
I would like to go throuh the gpus usage and find the first that is unused and use only this one.
I guess someone could parse the output of nvidia-smi with bash and get a variable i and feed that variable i to the tensorflow script as the number of the gpu to use.
I have never seen any example of this. I imagine it is a pretty common problem. What would be the simplest way to do that ? Is a pure tensorflow one available ?

I'm not aware of pure-TensorFlow solution. The problem is that existing place for TensorFlow configurations is a Session config. However, for GPU memory, a GPU memory pool is shared for all TensorFlow sessions within a process, so Session config would be the wrong place to add it, and there's no mechanism for process-global config (but there should be, to also be able to configure process-global Eigen threadpool). So you need to do on on a process level by using CUDA_VISIBLE_DEVICES environment variable.
Something like this:
import subprocess, re
# Nvidia-smi GPU memory parsing.
# Tested on nvidia-smi 370.23
def run_command(cmd):
"""Run command, return output as string."""
output = subprocess.Popen(cmd, stdout=subprocess.PIPE, shell=True).communicate()[0]
return output.decode("ascii")
def list_available_gpus():
"""Returns list of available GPU ids."""
output = run_command("nvidia-smi -L")
# lines of the form GPU 0: TITAN X
gpu_regex = re.compile(r"GPU (?P<gpu_id>\d+):")
result = []
for line in output.strip().split("\n"):
m = gpu_regex.match(line)
assert m, "Couldnt parse "+line
result.append(int(m.group("gpu_id")))
return result
def gpu_memory_map():
"""Returns map of GPU id to memory allocated on that GPU."""
output = run_command("nvidia-smi")
gpu_output = output[output.find("GPU Memory"):]
# lines of the form
# | 0 8734 C python 11705MiB |
memory_regex = re.compile(r"[|]\s+?(?P<gpu_id>\d+)\D+?(?P<pid>\d+).+[ ](?P<gpu_memory>\d+)MiB")
rows = gpu_output.split("\n")
result = {gpu_id: 0 for gpu_id in list_available_gpus()}
for row in gpu_output.split("\n"):
m = memory_regex.search(row)
if not m:
continue
gpu_id = int(m.group("gpu_id"))
gpu_memory = int(m.group("gpu_memory"))
result[gpu_id] += gpu_memory
return result
def pick_gpu_lowest_memory():
"""Returns GPU with the least allocated memory"""
memory_gpu_map = [(memory, gpu_id) for (gpu_id, memory) in gpu_memory_map().items()]
best_memory, best_gpu = sorted(memory_gpu_map)[0]
return best_gpu
You can then put it in utils.py and set GPU in your TensorFlow script before first tensorflow import. IE
import utils
import os
os.environ["CUDA_VISIBLE_DEVICES"] = str(utils.pick_gpu_lowest_memory())
import tensorflow

An implementation along the lines of Yaroslav Bulatov's solution is available on https://github.com/bamos/setGPU.

Related

fastai ulmfit model trained on cuda machin for cpu

I have an export.pkl model which has been trained on a cuda machine. I want to use it on a macbook:
from fastai.text import load_learner
from utils import get_corpus
learner = load_learner('./models')
corpus = get_corpus()
res = [ str(learner.predict(c)[0]) for c in corpus ]
I get the following error:
...
File "/Users/gautiergilabert/Envs/cc/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 146, in forward
"them on device: {}".format(self.src_device_obj, t.device))
RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cpu
I have two questions:
I found the raise in my export.pkl:
for t in chain(self.module.parameters(), self.module.buffers()):
if t.device != self.src_device_obj:
raise RuntimeError("module must have its parameters and buffers "
"on device {} (device_ids[0]) but found one of "
"them on device: {}".format(self.src_device_obj, t.device))
It is said about the module in the docstring: module to be parallelized. I don't really understand what it is. My macbook ?
Apart my macbook, I would like to run the model on a cpu
Is there a way to make this export.pkl model works on a cpu ?
Is there a way to make another export.pkl on cuda and make it available on a cpu ?
Thanks
One way is to load the learner by specifying the model with empty dataset and loading the model weights afterwards. For resnet image classifier something like this should work:
from fastai.vision import *
# path where the model is saved under path/models/model-name
path = "model_path"
tfms = get_transforms()
data = ImageDataBunch.single_from_classes(".", classes=["class1", "class2"], ds_tfms=tfms)
learner = cnn_learner(data, models.resnet34, metrics=accuracy)
# loads model from model_path/models/model_name.pth
learner.load("model_name")
image = open_image("test.jpg")
pred_class, pred_idx, outputs = learner.predict(image)

Memory leak when running universal-sentence-encoder-large itterating on dataframe

I have 140K sentences I want to get embeddings for. I am using TF_HUB Universal Sentence Encoder and am iterating over the sentences(I know it's not the best way but when I try to feed over 500 sentences into the model it crashes).
My Environment is:
Ubuntu 18.04
Python 3.7.4
TF 1.14
Ram: 16gb
processor: i-5
my code is:
version 1
I iterate inside the tf.session context manager
embed = hub.Module("https://tfhub.dev/google/universal-sentence-encoder-large/3")
df = pandas_repository.get_dataframe_from_table('sentences')
with tf.compat.v1.Session() as session:
session.run(tf.global_variables_initializer())
session.run(tf.tables_initializer())
sentence_embedding = None
for i, row in df.iterrows():
sentence = row['content']
embeddings = embed([sentence])
sentence_embedding = session.run(embeddings)
df.at[i, 'embedding'] = sentence_embedding
print('processed index:', i)
version 2
I open and close a session within each iteration
embed = hub.Module("https://tfhub.dev/google/universal-sentence-encoder-large/3")
df = pandas_repository.get_dataframe_from_table('sentences')
for i, row in df.iterrows():
sentence = row['content']
embeddings = embed([sentence])
sentence_embedding = None
with tf.compat.v1.Session() as session:
session.run(tf.global_variables_initializer())
session.run(tf.tables_initializer())
sentence_embedding = session.run(embeddings)
df.at[i, 'embedding'] = sentence_embedding
print('processed index:', i)
While version 2 does seem to have some sort of GC and memory is cleared a bit. It still goes over 50 items and explodes.
version 1 just goes on gobbling memory.
The correct solution as given by arnoegw
def calculate_embeddings(dataframe, table_name):
sql_get_sentences = "SELECT * FROM semantic_similarity.sentences WHERE embedding IS NULL LIMIT 1500"
sql_update = 'UPDATE {} SET embedding = data.embedding FROM (VALUES %s) AS data(id, embedding) WHERE {}.id = data.id'.format(table_name, table_name)
df = pandas_repository.get_dataframe_from_sql(sql_get_sentences)
with hub.eval_function_for_module("https://tfhub.dev/google/universal-sentence-encoder-large/3") as embed:
while len(df) >= 0:
sentence_array = df['content'].values
sentence_embeddings = embed(sentence_array)
df['embedding'] = sentence_embeddings.tolist()
values = [tuple(x) for x in df[['id', 'embedding']].values]
pandas_repository.update_db_from_df('semantic_similarity.sentences', sql_update, values)
df = pandas_repository.get_dataframe_from_sql(sql_get_sentences)
I am a newbee to TF and can use any help I can get.
Your code uses tf.Session, so it falls under the TF1.x programming model of first building a dataflow graph and then running it repeatedly with inputs being fed and outputs being fetched from the graph.
But your code does not align well with that programming model. Both versions keep adding new applications of (calls to) the hub.Module to the default TensorFlow graph instead of applying it once and running the same graph repeatedly for the various inputs. Version 2 keeps going into and out of tf.Sessions, which frees some memory but is very inefficient.
Please see my answer to "Strongly increasing memory consumption when using ELMo from Tensorflow-Hub" for guidance how to do it right in the graph-based programming model of TensorFlow 1.x.
TensorFlow 2.0, which is going to be released soon, defaults to the programming model of "eager execution", which does away with graphs and sessions and would have avoided this confusion. TensorFlow Hub will be updated in due course for TF2.0. For a preview close to your use-case, see https://colab.research.google.com/github/tensorflow/hub/blob/master/examples/colab/tf2_text_classification.ipynb

Find a GPU with enough memory

I want to programmatically find out the available GPUs and their current memory usage and use one of the GPUs based on their memory availability. I want to do this in PyTorch.
I have seen the following solution in this post:
import torch.cuda as cutorch
for i in range(cutorch.device_count()):
if cutorch.getMemoryUsage(i) > MEM:
opts.gpuID = i
break
but it is not working in PyTorch 0.3.1 (there is no function called, getMemoryUsage). I am interested in a PyTorch based (using the library functions) solution. Any help would be appreciated.
In the webpage you give, there exist an answer:
#!/usr/bin/env python
# encoding: utf-8
import subprocess
def get_gpu_memory_map():
"""Get the current gpu usage.
Returns
-------
usage: dict
Keys are device ids as integers.
Values are memory usage as integers in MB.
"""
result = subprocess.check_output(
[
'nvidia-smi', '--query-gpu=memory.used',
'--format=csv,nounits,noheader'
])
# Convert lines into a dictionary
gpu_memory = [int(x) for x in result.strip().split('\n')]
gpu_memory_map = dict(zip(range(len(gpu_memory)), gpu_memory))
return gpu_memory_map
print get_gpu_memory_map()

I want to used theano with GPU ,but it seemed that GPU is not worked?

I want to use theano with GPU ,and I use the following script to test if GPU is working:
import os
os.environ['THEANO_FLAGS'] = "device=gpu0"
import theano
from theano import function, config, shared, tensor
import numpy
import time
vlen = 10 * 30 * 768 # 10 x #cores x # threads per core
iters = 1000
rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], tensor.exp(x))
print(f.maker.fgraph.toposort())
t0 = time.time()
for i in range(iters):
r = f()
t1 = time.time()
print("Looping %d times took %f seconds" % (iters, t1 - t0))
print("Result is %s" % (r,))
if numpy.any([isinstance(x.op, tensor.Elemwise) and
('Gpu' not in type(x.op).__name__)
for x in f.maker.fgraph.toposort()]):
print('Used the cpu')
else:
print('Used the gpu')
but I get the following result:
WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release (v0.10). Please switch to the gpuarray backend. You can get more information about how to switch at this URL:
https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29
/usr/lib/python2.7/site-packages/theano/sandbox/cuda/__init__.py:556: UserWarning: Theano flag device=gpu* (old gpu back-end) only support floatX=float32. You have floatX=float64. Use the new gpu back-end with device=cuda* for that value of floatX.
warnings.warn(msg)
Using gpu device 0: GeForce GT 720 (CNMeM is enabled with initial size: 50.0% of memory, cuDNN 6021)
/usr/lib/python2.7/site-packages/theano/sandbox/cuda/__init__.py:631: UserWarning: Your cuDNN version is more recent than the one Theano officially supports. If you see any problems, try updating Theano or downgrading cuDNN to version 5.1.
warnings.warn(warn)
[Elemwise{exp,no_inplace}(<TensorType(float64, vector)>)]
Looping 1000 times took 3.424644 seconds
Result is [ 1.23178032 1.61879341 1.52278065 ..., 2.20771815 2.29967753
1.62323285]
Used the cpu
My Question
What does the result mean? and How do I make it work to use the GPU?
I had similar issue with new version of theano. You can try with
THEANO_FLAGS="floatX=float32,device=gpu,nvcc.flags=-D_FORCE_INLINES" python test_gpu.py

Distributed tensorflow with multiple gpu

It seems that tf.train.replica_device_setter doesn't allow specify gpu which work with.
What I want to do is like below:
with tf.device(
tf.train.replica_device_setter(
worker_device='/job:worker:task:%d/gpu:%d' % (deviceindex, gpuindex)):
<build-some-tf-graph>
If your parameters are not sharded, you could do it with a simplified version of replica_device_setter like below:
def assign_to_device(worker=0, gpu=0, ps_device="/job:ps/task:0/cpu:0"):
def _assign(op):
node_def = op if isinstance(op, tf.NodeDef) else op.node_def
if node_def.op == "Variable":
return ps_device
else:
return "/job:worker/task:%d/gpu:%d" % (worker, gpu)
return _assign
with tf.device(assign_to_device(1, 2)):
# this op goes on worker 1 gpu 2
my_op = tf.ones(())
I didn't check previous versions, but in Tensorflow 1.4/1.5, you can specify devices in replica_device_setter(worker_device='job:worker/task:%d/gpu:%d' %
(FLAGS.task_index, i), cluster=self.cluster).
See tensorflow/python/training/device_setter.py line 199-202:
if ps_ops is None:
# TODO(sherrym): Variables in the LOCAL_VARIABLES collection should not be
# placed in the parameter server.
ps_ops = ["Variable", "VariableV2", "VarHandleOp"]
Thanks to the code provided by #Yaroslav Bulatov, but his protocol is different from replica_device_setter, and may fail in some cases.