using MPI in tensorflow's distributed deeplearning - tensorflow

I am using a cluster which has 8*2080TI (11gb each) for distributed deeplearning. My goal is to utilize all the GPUs for training the model. My code uses MPI to gather all the process accross the cluster and tries to distribute accross all the workers. But this is giving me an error. I am using python 3.9 and TensorFlow/2.6.0-foss-2021a-CUDA-11.3.1.
I am currently lost and dont know what i need to do in this case. I tried installing other tenorflow versions using conda but it ends up like this.
slurm file :
#!/bin/bash
#SBATCH --job-name=job1 # Job name
#SBATCH --mem=30000 # Job memory request
#SBATCH --gres=gpu:4 # Number of requested GPU(s)
#SBATCH --time=3-23:00:00 # Time limit days-hrs:min:sec
#SBATCH --constraint=rtx_2080 # Specific hardware constraint
#SBATCH --error=slurm.err # Error file name
#SBATCH --output=slurm.out # Output file name
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --array=1-2%1
if [ -d "model-final" ]
then
scancel $SLURM_ARRAY_JOB_ID
else
module load Anaconda3/2020.07
module load TensorFlow/2.6.0-foss-2021a-CUDA-11.3.1
mpirun python -u main.py resume_latest
fi
my error:
Instructions for updating:
use distribute.MultiWorkerMirroredStrategy instead
2023-01-18 13:18:35.789808: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 9687 MB memory: -> device: 0, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:3d:00.0, compute capability: 7.5
2023-01-18 13:18:35.790848: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 9687 MB memory: -> device: 1, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:3e:00.0, compute capability: 7.5
2023-01-18 13:18:35.791743: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 9687 MB memory: -> device: 2, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:88:00.0, compute capability: 7.5
2023-01-18 13:18:35.792678: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:3 with 9687 MB memory: -> device: 3, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:89:00.0, compute capability: 7.5
2023-01-18 13:18:35.804893: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:worker/replica:0/task:0/device:GPU:0 with 9687 MB memory: -> device: 0, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:3d:00.0, compute capability: 7.5
2023-01-18 13:18:35.805620: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:worker/replica:0/task:0/device:GPU:1 with 9687 MB memory: -> device: 1, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:3e:00.0, compute capability: 7.5
2023-01-18 13:18:35.806333: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:worker/replica:0/task:0/device:GPU:2 with 9687 MB memory: -> device: 2, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:88:00.0, compute capability: 7.5
2023-01-18 13:18:35.807029: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:worker/replica:0/task:0/device:GPU:3 with 9687 MB memory: -> device: 3, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:89:00.0, compute capability: 7.5
2023-01-18 13:18:35.810512: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job worker -> {0 -> g01:37672}
2023-01-18 13:18:35.810736: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:427] Started server with target: grpc://g01:37672
/usr/ebuild/software/TensorFlow/2.6.0-foss-2021a-CUDA-11.3.1/lib/python3.9/site-packages/keras/optimizer_v2/optimizer_v2.py:355: UserWarning: The `lr` argument is deprecated, use `learning_rate` instead.
warnings.warn(
WARNING:tensorflow:`period` argument is deprecated. Please use `save_freq` to specify the frequency in number of batches seen.
2023-01-18 13:18:42.547198: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:695] AUTO sharding policy will apply DATA sharding policy as it failed to apply FILE sharding policy because of the following reason: Found an unshardable source dataset: name: "TensorSliceDataset/_2"
op: "TensorSliceDataset"
input: "Placeholder/_0"
input: "Placeholder/_1"
attr {
key: "Toutput_types"
value {
list {
type: DT_DOUBLE
type: DT_DOUBLE
}
}
}
attr {
key: "output_shapes"
value {
list {
shape {
dim {
size: 15
}
}
shape {
dim {
size: 13
}
}
}
}
}
2023-01-18 13:18:42.740015: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
[g01:44037:0:44313] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace (tid: 44313) ====
0 0x000000000002137e ucs_debug_print_backtrace() /umbc/ebuild-soft/cascade-lake/build/UCX/1.10.0/GCCcore-10.3.0/ucx-1.10.0/src/ucs/debug/debug.c:656
1 0x000000000382045b tensorflow::NcclCommunicator::Enqueue() collective_communicator.cc:0
2 0x0000000005c9f88a tensorflow::NcclReducer::Run() ???:0
3 0x00000000009086dc tensorflow::BaseCollectiveExecutor::ExecuteAsync(tensorflow::OpKernelContext*, tensorflow::CollectiveParams const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::function<void (tensorflow::Status const&)>)::{lambda()#3}::operator()() base_collective_executor.cc:0
4 0x0000000000b99403 tensorflow::UnboundedWorkQueue::PooledThreadFunc() ???:0
5 0x0000000000b9f6b1 tensorflow::(anonymous namespace)::PThread::ThreadFn() env.cc:0
6 0x0000000000007ea5 start_thread() pthread_create.c:0
7 0x00000000000feb0d __clone() ???:0
=================================
[g01:44037] *** Process received signal ***
[g01:44037] Signal: Segmentation fault (11)
[g01:44037] Signal code: (-6)
[g01:44037] Failing at address: 0x2ecf70000ac05
[g01:44037] [ 0] /lib64/libpthread.so.0(+0xf630)[0x2aaaab7e6630]
[g01:44037] [ 1] /usr/ebuild/software/TensorFlow/2.6.0-foss-2021a-CUDA-11.3.1/lib/python3.9/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(+0x382045b)[0x2aaab68fc45b]
[g01:44037] [ 2] /usr/ebuild/software/TensorFlow/2.6.0-foss-2021a-CUDA-11.3.1/lib/python3.9/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow11NcclReducer3RunESt8functionIFvRKNS_6StatusEEE+0x1ca)[0x2aaab8d7b88a]
[g01:44037] [ 3] /usr/ebuild/software/TensorFlow/2.6.0-foss-2021a-CUDA-11.3.1/lib/python3.9/site-packages/tensorflow/python/../libtensorflow_framework.so.2(+0x9086dc)[0x2aaadc7556dc]
[g01:44037] [ 4] /usr/ebuild/software/TensorFlow/2.6.0-foss-2021a-CUDA-11.3.1/lib/python3.9/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow18UnboundedWorkQueue16PooledThreadFuncEv+0x1b3)[0x2aaadc9e6403]
[g01:44037] [ 5] /usr/ebuild/software/TensorFlow/2.6.0-foss-2021a-CUDA-11.3.1/lib/python3.9/site-packages/tensorflow/python/../libtensorflow_framework.so.2(+0xb9f6b1)[0x2aaadc9ec6b1]
[g01:44037] [ 6] /lib64/libpthread.so.0(+0x7ea5)[0x2aaaab7deea5]
[g01:44037] [ 7] /lib64/libc.so.6(clone+0x6d)[0x2aaaac468b0d]
[g01:44037] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 44037 on node g01 exited on signal 11 (Segmentation fault).
from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
# Load in the parameter files
from json import load as loadf
with open("params.json", 'r') as inFile:
params = loadf(inFile)
# Get data files and prep them for the generator
from tensorflow import distribute as D
callbacks = []
devices = getDevices()
print(devices)
set_tf_config_mpi()
strat = D.experimental.MultiWorkerMirroredStrategy(
communication=D.experimental.CollectiveCommunication.NCCL)
# Create network
from sys import argv
resume_training = False
print(argv)
if "resume_latest" in argv:
resume_training = True
with strat.scope():
# Scheduler
if isinstance(params["learning_rate"], str):
# Get the string for the importable function
lr = params["learning_rate"]
from tensorflow.keras.callbacks import LearningRateScheduler
# Use a dummy learning rate
params["learning_rate"] = 0.1
# model = create_model(**params)
# Get the importable function
lr = lr.split(".")
baseImport = __import__(lr[0], globals(), locals(), [lr[1]], 0)
lr = getattr(baseImport, lr[1])
# Make a schedule
lr = LearningRateScheduler(lr)
callbacks.append(lr)
# Resume Model?
model_name = None
if resume_training:
initial_epoch, model_name = getInitialEpochsAndModelName(rank)
if model_name is None:
initial_epoch=0
model = create_model(**params)
resume_training = False
else:
from tensorflow.keras.models import load_model
model = load_model(model_name)
# Load data from disk
import numpy
if "root" in params.keys():
root = params['root']
else:
root = "./"
if "filename" in params.keys():
filename = params["filename"]
else:
filename = "dataset_timeseries.csv"
restricted = [
'euc1', 'e1', 'x1', 'y1', 'z1',
'euc2', 'e2', 'x2', 'y2', 'z2',
'euc3', 'e3', 'x3', 'y3', 'z3',
]
x, y = getOneHot("{}/{}".format(root, filename), restricted=restricted, **params)
# val_x, val_y = getOneHot("{}/{}".format(root, val_filename), restricted=restricted)
val_x, val_y = None, None
params["gbatch_size"] = params['batch_size'] * len(devices)
print("x.shape =", x.shape)
print("y.shape =", y.shape)
print("epochs =", params['epochs'], type(params['epochs']))
print("batch =", params['batch_size'], type(params['batch_size']))
print("gbatch =", params['gbatch_size'], type(params['gbatch_size']))
# Load data into a distributed dataset
# Dataset object does nothing in place:
# https://stackoverflow.com/questions/55645953/shape-of-tensorflow-dataset-data-in-keras-tensorflow-2-0-is-wrong-after-conver
from tensorflow.data import Dataset
data = Dataset.from_tensor_slices((x, y))
# Create validation set
v = params['validation']
if val_x is not None:
vrecord = val_x.shape[0]
val = Dataset.from_tensor_slices((val_x, val_y))
validation = val # data.take(vrecord)
else:
vrecord = int(x.shape[0]*v)
validation = data.take(vrecord)
validation = validation.batch(params['gbatch_size'])
validation = validation.repeat(params['epochs'])
# Validation -- need to do kfold one day
# This set should NOT be distributed
vsteps = vrecord // params['gbatch_size']
if vrecord % params['gbatch_size'] != 0:
vsteps += 1
# Shuffle the data during preprocessing or suffer...
# Parallel randomness == nightmare
# data = data.shuffle(x.shape[0])
# Ordering these two things is very important!
# Consider 3 elements, batch size 2 repeat 2
# [1 2 3] -> [[1 2] [3]] -> [[1 2] [3] [1 2] [3]] (correct) batch -> repeat
# [1 2 3] -> [1 2 3 1 2 3] -> [[1 2] [3 1] [2 3]] (incorrect) repeat -> batch
# data = data.skip(vrecord)
data = data.batch(params['gbatch_size'])
data = data.repeat(params['epochs'])
records = x.shape[0] # - vrecord
steps = records // params['gbatch_size']
if records % params['gbatch_size']:
steps += 1
print("steps =", steps)
# Note that if we are resuming that the number of _remaining_ epochs has
# changed!
# The number of epochs * steps is the numbers of samples to drop
print("initial cardinality = ", data.cardinality())
print("initial v cardinality = ", data.cardinality())
data = data.skip(initial_epoch*steps)
validation = validation.skip(initial_epoch*vsteps)
print("final cardinality = ", data.cardinality())
print("final v cardinality = ", data.cardinality())
# data = strat.experimental_distribute_dataset(data)
# Split into validation and training
callbacks = createCallbacks(params, callbacks, rank, resume_training)
print(callbacks)
history = model.fit(data, epochs=params['epochs'],
batch_size=params['gbatch_size'],
steps_per_epoch=steps,
verbose=0,
initial_epoch=initial_epoch,
validation_data=validation,
validation_steps=vsteps,
callbacks=callbacks)
if rank == 0:
model.save("model-final")
else:
model.save("checkpoints/model-tmp")

Related

How to automatically assign free GPUs in TensorFlow

I have 4 Tesla K80 GPUs in my system. I would like to automatically allocate free GPUs based on an integer input in the code. I am aware of tf.config.experimental.set_visible_devices() to assign specific GPUs but currently do not know how to identify which of the GPUs are in-use (expect manually using nvidia-smi). I am currently changing the code below for every run.
import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
# Restrict TensorFlow to only use the first GPU
try:
tf.config.experimental.set_visible_devices(gpus[2:], 'GPU')
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPU")
except RuntimeError as e:
# Visible devices must be set before GPUs have been initialized
print(e)
The above code lets me set the GPUs I want to allocate (GPU 2,3 in above example) for the run. Is there anyway to obtain a list of free (unused) devices to automate the allocation process instead manually having to identify which of the devices should be set?
I am currently using TensorFlow version 1.15
import subprocess, re
import os
import utils
# Nvidia-smi GPU memory parsing.
# Tested on nvidia-smi 370.23
# TF1.15
def run_command(cmd):
"""Run command, return output as string."""
output = subprocess.Popen(cmd, stdout=subprocess.PIPE, shell=True).communicate()[0]
return output.decode("ascii")
def list_available_gpus():
"""Returns list of available GPU ids."""
output = run_command("nvidia-smi -L")
# lines of the form GPU 0: TITAN X
gpu_regex = re.compile(r"GPU (?P<gpu_id>\d+):")
result = []
for line in output.strip().split("\n"):
m = gpu_regex.match(line)
assert m, "Couldnt parse "+line
result.append(int(m.group("gpu_id")))
return result
def gpu_memory_map():
"""Returns map of GPU id to memory allocated on that GPU."""
output = run_command("nvidia-smi")
gpu_output = output[output.find("GPU Memory"):]
# lines of the form
# | 0 8734 C python 11705MiB |
memory_regex = re.compile(r"[|]\s+?(?P<gpu_id>\d+)\D+?(?P<pid>\d+).+[ ](?P<gpu_memory>\d+)MiB")
rows = gpu_output.split("\n")
result = {gpu_id: 0 for gpu_id in list_available_gpus()}
for row in gpu_output.split("\n"):
m = memory_regex.search(row)
if not m:
continue
gpu_id = int(m.group("gpu_id"))
gpu_memory = int(m.group("gpu_memory"))
result[gpu_id] += gpu_memory
return result
def pick_gpu_lowest_memory():
"""Returns GPU with the least allocated memory"""
memory_gpu_map = [(memory, gpu_id) for (gpu_id, memory) in gpu_memory_map().items()]
best_memory, best_gpu = sorted(memory_gpu_map)[0]
return best_gpu
def pick_free_gpus(num_gpus=1):
"""Returns free GPUs with the least allocated memory"""
memory_gpu_map = [(memory, gpu_id) for (gpu_id, memory) in gpu_memory_map().items()]
sorted_list = sorted(memory_gpu_map)
gpu_list = []
for i in range(num_gpus):
if sorted_list[i][0] == 0:
gpu_list.append(sorted_list[i][1])
else:
print(f'Currently fewer than {num_gpus} GPUs are free right now, choose {i} or fewer GPUs')
exit()
return ','.join(map(str, gpu_list))
num_gpus = 2
os.environ["CUDA_VISIBLE_DEVICES"] = pick_free_gpus(num_gpus)
import tensorflow as tf
tf.config.optimizer.set_jit(True) # Enable XLA.
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
# Restrict TensorFlow to only use the first GPU
try:
tf.config.experimental.set_visible_devices(gpus, 'GPU')
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPU")
except RuntimeError as e:
# Visible devices must be set before GPUs have been initialized
print(e)

TF estimator gradient boosted classifier suddenly stopped while training

I have trained gradient boosted classifier with TF exampled code
https://www.tensorflow.org/tutorials/estimators/boosted_trees_model_understanding
but,
TF estimator gradient boosted classifier suddenly stopped while training
I think it takes several steps at begging , than suddenly stopped without any exception print
how can i get any reason why python crash
it 's hard to get the reason why it stopped
res:
lib : TF-gpu 1.13.1
cuda : 10.0
cudnn : 7.5
logs :
2019-04-15 16:40:26.175889: I
tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0
with properties: name: GeForce GTX 1060 6GB major: 6 minor: 1
memoryClockRate(GHz): 1.7845 pciBusID: 0000:07:00.0 totalMemory:
6.00GiB freeMemory: 4.97GiB 2019-04-15 16:40:26.182620: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible
gpu devices: 0 2019-04-15 16:40:26.832040: I
tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device
interconnect StreamExecutor with strength 1 edge matrix: 2019-04-15
16:40:26.835620: I
tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0
2019-04-15 16:40:26.836840: I
tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N
2019-04-15 16:40:26.838276: I
tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created
TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with
4716 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060
6GB, pci bus id: 0000:07:00.0, compute capability: 6.1)
WARNING:tensorflow:From
D:\python\lib\site-packages\tensorflow\python\training\saver.py:1266:
checkpoint_exists (from
tensorflow.python.training.checkpoint_management) is deprecated and
will be removed in a future version. Instructions for updating: Use
standard file APIs to check for files with this prefix.
WARNING:tensorflow:From
D:\python\lib\site-packages\tensorflow\python\training\saver.py:1070:
get_checkpoint_mtimes (from
tensorflow.python.training.checkpoint_management) is deprecated and
will be removed in a future version. Instructions for updating: Use
standard file utilities to get mtimes. WARNING:tensorflow:Issue
encountered when serializing resources. Type is unsupported, or the
types of the items don't match field type in CollectionDef. Note this
is a warning and probably safe to ignore. '_Resource' object has no
attribute 'name' WARNING:tensorflow:Issue encountered when serializing
resources. Type is unsupported, or the types of the items don't match
field type in CollectionDef. Note this is a warning and probably safe
to ignore. '_Resource' object has no attribute 'name'
D:\py> (just finished on training)
trn = pd.read_csv('data/santander-customer-transaction-prediction/train.csv')
tst = pd.read_csv('data/santander-customer-transaction-prediction/test.csv')
#trn = upsample(trn[trn.target==0], trn[trn.target==1])
# trn = downsample(trn[trn.target==0], trn[trn.target==1])
features = trn.columns.values[2:202]
target_name = trn.columns.values[1]
train=trn[features]
target=trn[target_name]
NUM_EXAMPLES = len (target)
print (NUM_EXAMPLES)
feat1 = train.corrwith(target).sort_values().head(20).index
feat2 = train.corrwith(target).sort_values().tail(20).index
featonly = feat1.append(feat2)
feat = featonly.append(pd.Index(['target']))
train_origin, tt = train_test_split(trn, test_size=0.2)
train = train_origin[featonly]
target = train_origin[target_name]
test = tst[featonly]
target_name_tst = tst.columns.values[1]
target_tst=tst[target_name_tst]
val_origin=tt
val_train = tt[featonly]
val_target = tt[target_name]
# Training and evaluation input functions.
train_input_fn = make_input_fn(train, target)
val_input_fn = make_input_fn(val_train, val_target)
ttt=tf.estimator.inputs.pandas_input_fn(x=test,num_epochs=1,shuffle=False)
del train,target,val_train,train_origin,trn,tst
fc = tf.feature_column
feature_columns = []
for feature_name in featonly:
feature_columns.append(fc.numeric_column(feature_name,dtype=tf.float32))
#feature_columns
#5
#tf.logging.set_verbosity(tf.logging.INFO)
#logging_hook = tf.train.LoggingTensorHook({"loss" : loss, "accuracy" : accuracy}, every_n_iter=10)
params = {
'n_trees': 50,
'max_depth': 3,
'n_batches_per_layer': 1,
# You must enable center_bias = True to get DFCs. This will force the model to
# make an initial prediction before using any features (e.g. use the mean of
# the training labels for regression or log odds for classification when
# using cross entropy loss).
'center_bias': True
}
# config = tf.estimator.RunConfig().replace(keep_checkpoint_max = 1,
# log_step_count_steps=20, save_checkpoints_steps=20)
est = tf.estimator.BoostedTreesClassifier(feature_columns, **params,model_dir='d:\py/model/')
est.train(train_input_fn, max_steps=50)
-------------------------------------------stopped
metrics = est.evaluate(input_fn=val_input_fn,steps=1)
results = est.predict(input_fn=ttt )
result_list = list(results)
classi = list(map(lambda x : x['classes'][0].decode("utf-8"), result_list))
num = list(range(0,len(classi)))
numi = list(map(lambda x : 'test_' + str(x),num))
#df1 = pd.DataFrame(columns=('ID_code','target'))
df_result = pd.DataFrame({'ID_code' : numi, 'target' : classi})
df_result.to_csv('result/submission03.csv',index=False)
def make_input_fn(X, y, n_epochs=None, shuffle=True):
def input_fn():
NUM_EXAMPLES = len(y)
dataset = tf.data.Dataset.from_tensor_slices((dict(X), y))
# dataset = tf.data.Dataset.from_tensor_slices((X.to_dict(orient='list'), y))
#if shuffle:
# dataset = dataset.shuffle(NUM_EXAMPLES)
# For training, cycle thru dataset as many times as need (n_epochs=None).
dataset = (dataset.repeat(n_epochs).batch(NUM_EXAMPLES))
return dataset
return input_fn
evaluation result should be shown
I think the problem is caused by GPU memory overflow.
You can try to modify the value of 'n_batches_per_layer' to some bigger value according to you GPU memory size.
I worked with a 6G GPU, the value is 16.

Transforming mLSTM - Run it on multiple GPUs

I'm running an mLSTM (multiplicative LSTM) transform (based on mLSTM by OpenAi (just the transform, it is already trained) but it takes a really long time to transform more than ~100,000 docs.
I want it to run on multiple GPUs. I saw some examples but I have no idea how to implement it on this mLSTM transform code.
The specific part that I want to run on multiple GPUs is:
def transform(xs):
tstart = time.time()
xs = [preprocess(x) for x in xs]
lens = np.asarray([len(x) for x in xs])
sorted_idxs = np.argsort(lens)
unsort_idxs = np.argsort(sorted_idxs)
sorted_xs = [xs[i] for i in sorted_idxs]
maxlen = np.max(lens)
offset = 0
n = len(xs)
smb = np.zeros((2, n, hps.nhidden), dtype=np.float32)
for step in range(0, ceil_round_step(maxlen, nsteps), nsteps):
start = step
end = step+nsteps
xsubseq = [x[start:end] for x in sorted_xs]
ndone = sum([x == b'' for x in xsubseq])
offset += ndone
xsubseq = xsubseq[ndone:]
sorted_xs = sorted_xs[ndone:]
nsubseq = len(xsubseq)
xmb, mmb = batch_pad(xsubseq, nsubseq, nsteps)
for batch in range(0, nsubseq, nbatch):
start = batch
end = batch+nbatch
batch_smb = seq_rep(
xmb[start:end], mmb[start:end],
smb[:, offset+start:offset+end, :])
smb[:, offset+start:offset+end, :] = batch_smb
features = smb[0, unsort_idxs, :]
print('%0.3f seconds to transform %d examples' %
(time.time() - tstart, n))
return features
This is just a snippet of the full code (I don't think it's OK to copy the entire code here).
The part you're referring to is not the place that splits the computation across GPUs, it only transforms the data (on a CPU!) and runs the session.
The correct place is one that defines the computational graph, e.g. mlstm method. There are many ways to split graph, e.g. place LSTM cells on different GPUs, so that the input sequence can be processed in parallel:
def mlstm(inputs, c, h, M, ndim, scope='lstm', wn=False):
[...]
for idx, x in enumerate(inputs):
with tf.device('/gpu:' + str(i % GPU_COUNT)):
m = tf.matmul(x, wmx) * tf.matmul(h, wmh)
z = tf.matmul(x, wx) + tf.matmul(m, wh) + b
[...]
By the way, there is a useful config option in tensorflow log_device_placement that helps to see the execution details in the output. Here's an example:
import tensorflow as tf
# Creates a graph.
with tf.device('/gpu:0'):
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], name='a')
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], name='b')
c = tf.add(a, b)
# Creates a session with log_device_placement set to True.
with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as sess:
# Prints the following:
# Device mapping:
# /job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: <GPU name>, pci bus id: 0000:01:00.0, compute capability: 6.1
# Add: (Add): /job:localhost/replica:0/task:0/device:GPU:0
# b: (Const): /job:localhost/replica:0/task:0/device:GPU:0
# a: (Const): /job:localhost/replica:0/task:0/device:GPU:0
print(sess.run(c))

MatMul in TensorFlow is slower than dot product in numpy

I am observing that on my machine tf.matmul in tensorflow is running significantly slower than dot product in numpy. I have GTX 1080 GPU, and expecting tf.matmul to be at least as fast as when running the code using CPU (numpy).
Environment Info
Operating System
lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.10
Release: 16.10
Codename: yakkety
Installed version of CUDA and cuDNN:
ls -l /usr/local/cuda-8.0/lib64/libcud*
-rw-r--r-- 1 root root 556000 Feb 22 2017 /usr/local/cuda-8.0/lib64/libcudadevrt.a
lrwxrwxrwx 1 root root 16 Feb 22 2017 /usr/local/cuda-8.0/lib64/libcudart.so -> libcudart.so.8.0
lrwxrwxrwx 1 root root 19 Feb 22 2017 /usr/local/cuda-8.0/lib64/libcudart.so.8.0 -> libcudart.so.8.0.61
-rwxr-xr-x 1 root root 415432 Feb 22 2017 /usr/local/cuda-8.0/lib64/libcudart.so.8.0.61
-rw-r--r-- 1 root root 775162 Feb 22 2017 /usr/local/cuda-8.0/lib64/libcudart_static.a
lrwxrwxrwx 1 voldemaro users 13 Nov 6 2016 /usr/local/cuda-8.0/lib64/libcudnn.so -> libcudnn.so.5
lrwxrwxrwx 1 voldemaro users 18 Nov 6 2016 /usr/local/cuda-8.0/lib64/libcudnn.so.5 -> libcudnn.so.5.1.10
-rwxr-xr-x 1 voldemaro users 84163560 Nov 6 2016 /usr/local/cuda-8.0/lib64/libcudnn.so.5.1.10
-rw-r--r-- 1 voldemaro users 70364814 Nov 6 2016 /usr/local/cuda-8.0/lib64/libcudnn_static.a
TensorFlow Setup
python -c "import tensorflow; print(tensorflow.__version__)"
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
1.0.0
Code:
'''
Created on Sep 28, 2017
#author: voldemaro
Running on I7/GTX 1080
no MKL
('TF version: ', 'v1.0.0-rc2-15-g47bba63-dirty')
('TF url: ', 'https://github.com/tensorflow/tensorflow/commit/47bba63')
Timing in ms for 2048 x 2048 SVD of type <type 'numpy.float32'> and matmul for 16920 x 2048 of type <type 'numpy.float32'>
numpy default SVD min: 3956.20, median: 4127.75, mean: 4264.41
TF CPU SVD min: 5926.43, median: 5951.70, mean: 5961.43
TF GPU SVD min: 5917.10, median: 6015.87, mean: 6039.63
numpy default .dot product min: 5816.97, median: 5933.43, mean: 5965.22
TF CPU matmul min: 21939.19, median: 22485.99, mean: 22374.69
TF GPU matmul min: 22026.52, median: 22109.97, mean: 22199.43
'''
from scipy import linalg; # for svd
import numpy as np;
import os;
import sys;
import time;
os.environ["TF_CPP_MIN_LOG_LEVEL"]="2" # nospam
import tensorflow as tf;
import gc; gc.disable();
NUM_RUNS = 5;
dtype = np.float32;
N=2048;
M = 16920;
def get_tensorflow_version_url():
import tensorflow as tf
version=tf.__version__
commit = tf.__git_version__
# commit looks like this
# 'v1.0.0-65-g4763edf-dirty'
commit = commit.replace("'","")
if commit.endswith('-dirty'):
dirty = True
commit = commit[:-len('-dirty')]
commit=commit.rsplit('-g', 1)[1]
url = 'https://github.com/tensorflow/tensorflow/commit/'+commit
return url
def get_mkl_version():
import ctypes
import numpy as np
ver = np.zeros(199, dtype=np.uint8)
mkl = ctypes.cdll.LoadLibrary("libmkl_rt.so")
mkl.MKL_Get_Version_String(ver.ctypes.data_as(ctypes.c_char_p), 198)
return ver[ver != 0].tostring()
timeline_counter = 0
run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE);
def benchmark(message, func):
time_list = []
for i in range(NUM_RUNS):
start_time = time.time();
func();
time_list.append(time.time()-start_time);
time_list = 1000*np.array(time_list); # get seconds, convert to ms
if len(time_list)>0:
min = np.min(time_list);
median = np.median(time_list);
formatted = ["%.2f"%(d,) for d in time_list[:10]];
result = "min: %8.2f, median: %8.2f, mean: %8.2f"%(min, median, np.mean(time_list))
else:
result = "empty"
print("%-20s %s"%(message, result))
if np.__config__.get_info("lapack_mkl_info"):
print("MKL version", get_mkl_version())
else:
print("no MKL")
print("TF version: ", tf.__git_version__)
print("TF url: ", get_tensorflow_version_url())
svd_array = np.random.random_sample((N,N)).astype(dtype);
another_array = np.random.random_sample((M,N)).astype(dtype);
init_OP = tf.global_variables_initializer();
with tf.device("/gpu:0"):
init_holder_gpu = tf.placeholder(dtype, shape=(M,M));
specVarGPU = tf.random_uniform((N,N), dtype=dtype);
S_gpu = tf.random_uniform((M,N), dtype=dtype);
V_gpu = tf.matmul(tf.matmul(tf.transpose(tf.transpose(tf.conj(S_gpu))), specVarGPU, ), tf.transpose(S_gpu));
[D2_gpu, E1_gpu, E2_gpu] = tf.svd(specVarGPU);
with tf.device("/cpu:0"):
init_holder_cpu = tf.placeholder(dtype, shape=(M,M));
specVarCPU = tf.random_uniform((N,N), dtype=dtype);
S_cpu = tf.random_uniform((M,N), dtype=dtype);
V_cpu = tf.matmul(tf.matmul(tf.transpose(tf.transpose(tf.conj(S_cpu))), specVarCPU, ), tf.transpose(S_cpu));
[D2_cpu, E1_cpu, E2_cpu] = tf.svd(specVarCPU);
V_cpu = tf.matmul(tf.matmul(tf.transpose(tf.transpose(tf.conj(S_cpu))), E1_cpu), tf.transpose(S_cpu));
print("Timing in ms for %d x %d SVD of type %s and matmul for %d x %d of type %s"%(N, N, dtype, M, N, dtype));
def func(): linalg.svd(svd_array)
benchmark("numpy default SVD", func)
config = tf.ConfigProto(allow_soft_placement = True, graph_options=tf.GraphOptions(optimizer_options=tf.OptimizerOptions(opt_level=tf.OptimizerOptions.L0)));
sess = tf.Session(config = config);
sess.run(init_OP);
def func2(): sess.run([D2_cpu.op, E1_cpu.op, E2_cpu.op]);
benchmark("TF CPU SVD", func2);
def func3(): sess.run([D2_gpu.op, E1_gpu.op, E2_gpu.op]);
benchmark("TF GPU SVD", func3);
def func1(): np.transpose(np.asmatrix(another_array)).getH().dot(svd_array).dot(np.transpose(another_array));
benchmark("numpy default .dot product", func1)
def func4(): sess.run([V_cpu]);
benchmark("TF CPU matmul", func4)
def func5(): sess.run([V_gpu])
benchmark("TF GPU matmul", func4)
Apparently tensorflow does not optimize "nested" operations, so
tf.matmul(tf.transpose(tf.conj(a)), x) takes significantly longer time than b = tf.conj(a), c = tf.transpose(b), and d = tf.matmul(c, x).
For SVD, the problem is that there is no GPU Kernel for SVD yet. See here: https://github.com/tensorflow/tensorflow/issues/11588
This means that SVD has to be computed on the CPU, even if the tensors are instantiated on the GPU. For this reason, there's an overhead for transferring data from the GPU to the CPU for computation, then back to the GPU for storing results.
For matmul on the GPU the problem is in the last line of your bechmarking code: you are not calling func5 but func4 again, so you are benchmarking the TF CPU matmul.
Aside from this, there are a few other things you may want to check in your code:
there is no need for the init_holder_cpu and init_holder_gpu vars, as you don't use them
there is no need to run the global_variables_initializer, as there are no variables
you are redefining V_cpu, using one of the outputs from SVD, so you are effectively running both SVD and the matmul in your test
A slightly cleaned up version of the code looks like:
# ... above is the same
print("TF version: ", tf.__git_version__)
print("TF url: ", get_tensorflow_version_url())
svd_array = np.random.random_sample((N,N)).astype(dtype)
another_array = np.random.random_sample((M,N)).astype(dtype)
with tf.device("/gpu:0"):
specVarGPU = tf.random_uniform((N, N), dtype=dtype)
S_gpu = tf.random_uniform((M, N), dtype=dtype)
V_gpu = tf.matmul(tf.matmul(tf.transpose(tf.transpose(tf.conj(S_gpu))), specVarGPU, ), tf.transpose(S_gpu))
D2_gpu, E1_gpu, E2_gpu = tf.svd(specVarGPU)
with tf.device("/cpu:0"):
specVarCPU = tf.random_uniform((N,N), dtype=dtype)
S_cpu = tf.random_uniform((M,N), dtype=dtype)
V_cpu = tf.matmul(tf.matmul(tf.transpose(tf.transpose(tf.conj(S_cpu))), specVarCPU, ), tf.transpose(S_cpu))
D2_cpu, E1_cpu, E2_cpu = tf.svd(specVarCPU)
config = tf.ConfigProto(allow_soft_placement = True, graph_options=tf.GraphOptions(optimizer_options=tf.OptimizerOptions(opt_level=tf.OptimizerOptions.L0)))
def V_numpy():
np.matmul(np.matmul(np.transpose(np.transpose(np.conj(another_array))), svd_array, ), np.transpose(another_array))
with tf.Session(config = config) as sess:
print("Timing in ms for %d x %d SVD of type %s and matmul for %d x %d of type %s"%(N, N, dtype, M, N, dtype))
benchmark("numpy default SVD", lambda: linalg.svd(svd_array))
benchmark("TF CPU SVD", lambda: sess.run([D2_cpu.op, E1_cpu.op, E2_cpu.op]))
benchmark("TF GPU SVD", lambda: sess.run([D2_gpu.op, E1_gpu.op, E2_gpu.op]))
benchmark("numpy MKL matmul", V_numpy)
benchmark("TF CPU matmul", lambda: sess.run([V_cpu.op]))
benchmark("TF GPU matmul", lambda: sess.run([V_gpu.op]))
And outputs (ona an i7 and GTX 1070)
MKL version b'Intel(R) Math Kernel Library Version 2017.0.3 Product Build 20170413 for Intel(R) 64 architecture applications'
TF version: v1.4.0-rc1-11-g130a514
TF url: https://github.com/tensorflow/tensorflow/commit/130a514
Timing in ms for 2048 x 2048 SVD of type <class 'numpy.float32'> and matmul for 16920 x 2048 of type <class 'numpy.float32'>
numpy default SVD min: 3318.42, median: 3320.40, mean: 3320.40
TF CPU SVD min: 4576.71, median: 4577.02, mean: 4577.02
TF GPU SVD min: 14022.59, median: 14172.69, mean: 14172.69
numpy MKL matmul min: 4500.33, median: 4628.01, mean: 4628.01
TF CPU matmul min: 15420.19, median: 15664.84, mean: 15664.84
TF GPU matmul min: 277.80, median: 282.54, mean: 282.54
You can see that the GPU version of matmul is much faster than any CPU implementation, as expected.

Selecting the device to be used by a graph in TensorFlow

Given a code which uses multiple graphs or multiple versions of the same graph, it is sometimes necessary to ensure that a particular graph uses only CPU for computation, while some other graph uses only GPU.
The basic question is
How to make sure that a particular graph makes use of only CPU (XOR) GPU for computations ?
There is not an exhaustive discussion of this topic on SO and hence this question.
I have tried a number of different approaches and none seem to work as will be outlined below.
Before further details on the question and various options that have been tried, I will lay down following details :-
TensorFlow Version : 'v1.1.0-rc2-1003-g3792dd9' 1.1.0-rc2 (Compiled from source)
OS details : CentOS Linux release 7.2.1511 (Core)
Bazel version : 0.4.5
Basic code with which various approaches have been tried is mentioned below :
import tensorflow as tf
from tensorflow.python.client import timeline
import matplotlib.pyplot as plt
def coloraugment(image):
output = tf.image.random_brightness(image, max_delta=10./255.)
output = tf.clip_by_value(output, 0.0, 1.0)
output = tf.image.random_saturation(output, lower=0.5, upper=1.5)
output = tf.clip_by_value(output, 0.0, 1.0)
output = tf.image.random_contrast(output, lower=0.5, upper=1.5)
output = tf.clip_by_value(output, 0.0, 1.0)
return output
def augmentbody(image, sz):
for i in range(10):
if i == 0:
cropped = tf.random_crop(value=image, size=sz)
croppedflipped = tf.image.flip_left_right(cropped)
out = tf.stack([cropped, croppedflipped], axis=0)
else:
cropimg = tf.random_crop(value=image, size=sz)
augcolor = coloraugment(cropimg)
augflipped = tf.image.flip_left_right(augcolor)
coll = tf.stack([augcolor, augflipped], axis=0)
out = tf.concat([coll, out], axis=0)
out = tf.random_shuffle(out)
return out
def aspect1(aspectratio):
newheight = tf.constant(256, dtype=tf.float32)
newwidth = tf.divide(newheight, aspectratio)
newsize = tf.stack([newheight, newwidth], axis=0)
newsize = tf.cast(newsize, dtype=tf.int32)
return newsize
def aspect2(aspectratio):
newwidth = tf.constant(256, dtype=tf.float32)
newheight = tf.multiply(newwidth, aspectratio)
newsize = tf.stack([newheight, newwidth], axis=0)
newsize = tf.cast(newsize, dtype=tf.int32)
return newsize
def resize_image(image):
imageshape = tf.shape(image)
imageheight = tf.cast(tf.gather(imageshape, tf.constant(0, dtype=tf.int32)),
dtype=tf.float32)
imagewidth = tf.cast(tf.gather(imageshape, tf.constant(1, dtype=tf.int32)),
dtype=tf.float32)
aspectratio = tf.divide(imageheight, imagewidth)
newsize = tf.cond(tf.less_equal(imageheight, imagewidth),
lambda: aspect1(aspectratio),
lambda: aspect2(aspectratio))
image = tf.image.convert_image_dtype(image, dtype=tf.float32)
image = tf.image.resize_images(image, newsize)
return image
def readimage(file_queue):
reader = tf.WholeFileReader()
key, value = reader.read(file_queue)
image = tf.image.decode_jpeg(value)
image = resize_image(image)
return image
if __name__ == "__main__":
queue = tf.train.string_input_producer(["holly2.jpg"])
image = readimage(queue)
augmented = augmentbody(image, [221,221,3])
init_op = tf.global_variables_initializer()
config_cpu = tf.ConfigProto()
config = tf.ConfigProto(
device_count = {'GPU': 0}
)
sess_cpu = tf.Session(config=config)
with tf.Session(config=config_cpu) as sess:
run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
run_metadata = tf.RunMetadata()
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)
sess.run(init_op)
[tense] = sess.run([augmented],options=run_options, run_metadata=run_metadata)
coord.request_stop()
coord.join(threads)
tl = timeline.Timeline(run_metadata.step_stats)
ctf = tl.generate_chrome_trace_format()
with open('timeline.json', 'w') as f:
f.write(ctf)
print("The tensor size is {}".format(tense.shape))
numcols = tense.shape[0]/2
for i in range(tense.shape[0]):
plt.subplot(2,numcols,i+1)
plt.imshow(tense[i, :, :, :])
plt.show()
plt.close()
Various approaches which have been tried
Various related questions exist on SO with accepted answers, but they
do not seem to work very well as I outline with examples and outputs.
Various tried approaches
Approach 1
Related question is ( Run Tensorflow on CPU ). The accepted answer is to run tf.Session() with the following configuration :
config = tf.ConfigProto(
device_count = {'GPU': 0}
)
sess = tf.Session(config=config)
The corresponding output is :
2017-05-18 13:34:27.477189: I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] Found device 0 with properties:
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.7715
pciBusID 0000:04:00.0
Total memory: 7.92GiB
Free memory: 7.80GiB
2017-05-18 13:34:27.477232: I tensorflow/core/common_runtime/gpu/gpu_device.cc:927] DMA: 0
2017-05-18 13:34:27.477240: I tensorflow/core/common_runtime/gpu/gpu_device.cc:937] 0: Y
2017-05-18 13:34:27.477259: I tensorflow/core/common_runtime/gpu/gpu_device.cc:996] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:04:00.0)
2017-05-18 13:34:27.482600: I tensorflow/core/common_runtime/gpu/gpu_device.cc:996] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:04:00.0)
2017-05-18 13:34:27.848864: I tensorflow/compiler/xla/service/platform_util.cc:58] platform CUDA present with 1 visible devices
2017-05-18 13:34:27.848902: I tensorflow/compiler/xla/service/platform_util.cc:58] platform Host present with 40 visible devices
2017-05-18 13:34:27.851670: I tensorflow/compiler/xla/service/service.cc:184] XLA service 0x7f0fd81d5500 executing computations on platform Host. Devices:
2017-05-18 13:34:27.851688: I tensorflow/compiler/xla/service/service.cc:192] StreamExecutor device (0): <undefined>, <undefined>
2017-05-18 13:34:27.851894: I tensorflow/compiler/xla/service/platform_util.cc:58] platform CUDA present with 1 visible devices
2017-05-18 13:34:27.851903: I tensorflow/compiler/xla/service/platform_util.cc:58] platform Host present with 40 visible devices
2017-05-18 13:34:27.854698: I tensorflow/compiler/xla/service/service.cc:184] XLA service 0x7f0fd82b4c50 executing computations on platform CUDA. Devices:
2017-05-18 13:34:27.854713: I tensorflow/compiler/xla/service/service.cc:192] StreamExecutor device (0): GeForce GTX 1080, Compute Capability 6.1
2017-05-18 13:34:28.918980: I tensorflow/stream_executor/dso_loader.cc:139] successfully opened CUDA library libcupti.so.8.0 locally
You can clearly see that the GPU is still being used and the XLA
service is running on GPU
Approach 2
Related question is ( Run Tensorflow on CPU ). This answer states that the following environment variable can be set as follows to use CPU
CUDA_VISIBLE_DEVICES=""
When the GPU computation is required, it can be unset.
The corresponding output is
2017-05-18 13:42:24.871020: E tensorflow/stream_executor/cuda/cuda_driver.cc:406] failed call to cuInit: CUDA_ERROR_NO_DEVICE
2017-05-18 13:42:24.871071: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:158] retrieving CUDA diagnostic information for host: nefgpu12
2017-05-18 13:42:24.871081: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname: nefgpu12
2017-05-18 13:42:24.871114: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported version is: 367.48.0
2017-05-18 13:42:24.871147: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:369] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module 367.48 Sat Sep 3 18:21:08 PDT 2016
GCC version: gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC)
"""
2017-05-18 13:42:24.871170: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:193] kernel reported version is: 367.48.0
2017-05-18 13:42:24.871178: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:300] kernel version seems to match DSO: 367.48.0
2017-05-18 13:42:25.159632: W tensorflow/compiler/xla/service/platform_util.cc:61] platform CUDA present but no visible devices found
2017-05-18 13:42:25.159674: I tensorflow/compiler/xla/service/platform_util.cc:58] platform Host present with 40 visible devices
2017-05-18 13:42:25.162626: I tensorflow/compiler/xla/service/service.cc:184] XLA service 0x7f5798002df0 executing computations on platform Host. Devices:
2017-05-18 13:42:25.162663: I tensorflow/compiler/xla/service/service.cc:192] StreamExecutor device (0): <undefined>, <undefined>
2017-05-18 13:42:25.223309: I tensorflow/stream_executor/dso_loader.cc:139] successfully opened CUDA library libcupti.so.8.0 locally
You can see from this output that the GPU is not being used.
Approach 3
The related question is ( Running multiple graphs in different device modes in TensorFlow ). One answer gives the following solution :
# The config for CPU usage
config_cpu = tf.ConfigProto()
config_cpu.gpu_options.visible_device_list=''
sess_cpu = tf.Session(config=config_cpu)
# The config for GPU usage
config_gpu = tf.ConfigProto()
config_gpu.gpu_options.visible_device_list='0'
sess_gpu = tf.Session(config=config_gpu)
The output of using the configuration for CPU usage as outlined in the solution is as follows :
2017-05-18 13:50:32.999431: I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] Found device 0 with properties:
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.7715
pciBusID 0000:04:00.0
Total memory: 7.92GiB
Free memory: 7.80GiB
2017-05-18 13:50:32.999472: I tensorflow/core/common_runtime/gpu/gpu_device.cc:927] DMA: 0
2017-05-18 13:50:32.999478: I tensorflow/core/common_runtime/gpu/gpu_device.cc:937] 0: Y
2017-05-18 13:50:32.999490: I tensorflow/core/common_runtime/gpu/gpu_device.cc:996] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:04:00.0)
2017-05-18 13:50:33.084737: I tensorflow/core/common_runtime/gpu/gpu_device.cc:996] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:04:00.0)
2017-05-18 13:50:33.395798: I tensorflow/compiler/xla/service/platform_util.cc:58] platform CUDA present with 1 visible devices
2017-05-18 13:50:33.395837: I tensorflow/compiler/xla/service/platform_util.cc:58] platform Host present with 40 visible devices
2017-05-18 13:50:33.398634: I tensorflow/compiler/xla/service/service.cc:184] XLA service 0x7f08181ecfa0 executing computations on platform Host. Devices:
2017-05-18 13:50:33.398695: I tensorflow/compiler/xla/service/service.cc:192] StreamExecutor device (0): <undefined>, <undefined>
2017-05-18 13:50:33.398908: I tensorflow/compiler/xla/service/platform_util.cc:58] platform CUDA present with 1 visible devices
2017-05-18 13:50:33.398920: I tensorflow/compiler/xla/service/platform_util.cc:58] platform Host present with 40 visible devices
2017-05-18 13:50:33.401731: I tensorflow/compiler/xla/service/service.cc:184] XLA service 0x7f081821e1f0 executing computations on platform CUDA. Devices:
2017-05-18 13:50:33.401745: I tensorflow/compiler/xla/service/service.cc:192] StreamExecutor device (0): GeForce GTX 1080, Compute Capability 6.1
2017-05-18 13:50:34.484142: I tensorflow/stream_executor/dso_loader.cc:139] successfully opened CUDA library libcupti.so.8.0 locally
You can see that the GPU is still being used.
See issues #9201 and #2175. The fact that the GPU devices are created does not mean that your graph is necessarily running on the GPU. You can enforce CPU execution with device_count = {'GPU': 0} or tf.device, but the GPU devices are still created with the session, just in case some op wants it. About 'CUDA_VISIBLE_DEVICES', making it empty did not work for me either, but doing export CUDA_VISIBLE_DEVICES"-1" (before starting Python, or inside Python through os.environ before importing TensorFlow) did the trick (TensorFlow will output a warning about the GPU not being found, but it will work). You can see the documentation for CUDA_VISIBLE_DEVICES here.