TF estimator gradient boosted classifier suddenly stopped while training - tensorflow

I have trained gradient boosted classifier with TF exampled code
https://www.tensorflow.org/tutorials/estimators/boosted_trees_model_understanding
but,
TF estimator gradient boosted classifier suddenly stopped while training
I think it takes several steps at begging , than suddenly stopped without any exception print
how can i get any reason why python crash
it 's hard to get the reason why it stopped
res:
lib : TF-gpu 1.13.1
cuda : 10.0
cudnn : 7.5
logs :
2019-04-15 16:40:26.175889: I
tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0
with properties: name: GeForce GTX 1060 6GB major: 6 minor: 1
memoryClockRate(GHz): 1.7845 pciBusID: 0000:07:00.0 totalMemory:
6.00GiB freeMemory: 4.97GiB 2019-04-15 16:40:26.182620: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible
gpu devices: 0 2019-04-15 16:40:26.832040: I
tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device
interconnect StreamExecutor with strength 1 edge matrix: 2019-04-15
16:40:26.835620: I
tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0
2019-04-15 16:40:26.836840: I
tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N
2019-04-15 16:40:26.838276: I
tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created
TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with
4716 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060
6GB, pci bus id: 0000:07:00.0, compute capability: 6.1)
WARNING:tensorflow:From
D:\python\lib\site-packages\tensorflow\python\training\saver.py:1266:
checkpoint_exists (from
tensorflow.python.training.checkpoint_management) is deprecated and
will be removed in a future version. Instructions for updating: Use
standard file APIs to check for files with this prefix.
WARNING:tensorflow:From
D:\python\lib\site-packages\tensorflow\python\training\saver.py:1070:
get_checkpoint_mtimes (from
tensorflow.python.training.checkpoint_management) is deprecated and
will be removed in a future version. Instructions for updating: Use
standard file utilities to get mtimes. WARNING:tensorflow:Issue
encountered when serializing resources. Type is unsupported, or the
types of the items don't match field type in CollectionDef. Note this
is a warning and probably safe to ignore. '_Resource' object has no
attribute 'name' WARNING:tensorflow:Issue encountered when serializing
resources. Type is unsupported, or the types of the items don't match
field type in CollectionDef. Note this is a warning and probably safe
to ignore. '_Resource' object has no attribute 'name'
D:\py> (just finished on training)
trn = pd.read_csv('data/santander-customer-transaction-prediction/train.csv')
tst = pd.read_csv('data/santander-customer-transaction-prediction/test.csv')
#trn = upsample(trn[trn.target==0], trn[trn.target==1])
# trn = downsample(trn[trn.target==0], trn[trn.target==1])
features = trn.columns.values[2:202]
target_name = trn.columns.values[1]
train=trn[features]
target=trn[target_name]
NUM_EXAMPLES = len (target)
print (NUM_EXAMPLES)
feat1 = train.corrwith(target).sort_values().head(20).index
feat2 = train.corrwith(target).sort_values().tail(20).index
featonly = feat1.append(feat2)
feat = featonly.append(pd.Index(['target']))
train_origin, tt = train_test_split(trn, test_size=0.2)
train = train_origin[featonly]
target = train_origin[target_name]
test = tst[featonly]
target_name_tst = tst.columns.values[1]
target_tst=tst[target_name_tst]
val_origin=tt
val_train = tt[featonly]
val_target = tt[target_name]
# Training and evaluation input functions.
train_input_fn = make_input_fn(train, target)
val_input_fn = make_input_fn(val_train, val_target)
ttt=tf.estimator.inputs.pandas_input_fn(x=test,num_epochs=1,shuffle=False)
del train,target,val_train,train_origin,trn,tst
fc = tf.feature_column
feature_columns = []
for feature_name in featonly:
feature_columns.append(fc.numeric_column(feature_name,dtype=tf.float32))
#feature_columns
#5
#tf.logging.set_verbosity(tf.logging.INFO)
#logging_hook = tf.train.LoggingTensorHook({"loss" : loss, "accuracy" : accuracy}, every_n_iter=10)
params = {
'n_trees': 50,
'max_depth': 3,
'n_batches_per_layer': 1,
# You must enable center_bias = True to get DFCs. This will force the model to
# make an initial prediction before using any features (e.g. use the mean of
# the training labels for regression or log odds for classification when
# using cross entropy loss).
'center_bias': True
}
# config = tf.estimator.RunConfig().replace(keep_checkpoint_max = 1,
# log_step_count_steps=20, save_checkpoints_steps=20)
est = tf.estimator.BoostedTreesClassifier(feature_columns, **params,model_dir='d:\py/model/')
est.train(train_input_fn, max_steps=50)
-------------------------------------------stopped
metrics = est.evaluate(input_fn=val_input_fn,steps=1)
results = est.predict(input_fn=ttt )
result_list = list(results)
classi = list(map(lambda x : x['classes'][0].decode("utf-8"), result_list))
num = list(range(0,len(classi)))
numi = list(map(lambda x : 'test_' + str(x),num))
#df1 = pd.DataFrame(columns=('ID_code','target'))
df_result = pd.DataFrame({'ID_code' : numi, 'target' : classi})
df_result.to_csv('result/submission03.csv',index=False)
def make_input_fn(X, y, n_epochs=None, shuffle=True):
def input_fn():
NUM_EXAMPLES = len(y)
dataset = tf.data.Dataset.from_tensor_slices((dict(X), y))
# dataset = tf.data.Dataset.from_tensor_slices((X.to_dict(orient='list'), y))
#if shuffle:
# dataset = dataset.shuffle(NUM_EXAMPLES)
# For training, cycle thru dataset as many times as need (n_epochs=None).
dataset = (dataset.repeat(n_epochs).batch(NUM_EXAMPLES))
return dataset
return input_fn
evaluation result should be shown

I think the problem is caused by GPU memory overflow.
You can try to modify the value of 'n_batches_per_layer' to some bigger value according to you GPU memory size.
I worked with a 6G GPU, the value is 16.

Related

using MPI in tensorflow's distributed deeplearning

I am using a cluster which has 8*2080TI (11gb each) for distributed deeplearning. My goal is to utilize all the GPUs for training the model. My code uses MPI to gather all the process accross the cluster and tries to distribute accross all the workers. But this is giving me an error. I am using python 3.9 and TensorFlow/2.6.0-foss-2021a-CUDA-11.3.1.
I am currently lost and dont know what i need to do in this case. I tried installing other tenorflow versions using conda but it ends up like this.
slurm file :
#!/bin/bash
#SBATCH --job-name=job1 # Job name
#SBATCH --mem=30000 # Job memory request
#SBATCH --gres=gpu:4 # Number of requested GPU(s)
#SBATCH --time=3-23:00:00 # Time limit days-hrs:min:sec
#SBATCH --constraint=rtx_2080 # Specific hardware constraint
#SBATCH --error=slurm.err # Error file name
#SBATCH --output=slurm.out # Output file name
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --array=1-2%1
if [ -d "model-final" ]
then
scancel $SLURM_ARRAY_JOB_ID
else
module load Anaconda3/2020.07
module load TensorFlow/2.6.0-foss-2021a-CUDA-11.3.1
mpirun python -u main.py resume_latest
fi
my error:
Instructions for updating:
use distribute.MultiWorkerMirroredStrategy instead
2023-01-18 13:18:35.789808: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 9687 MB memory: -> device: 0, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:3d:00.0, compute capability: 7.5
2023-01-18 13:18:35.790848: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 9687 MB memory: -> device: 1, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:3e:00.0, compute capability: 7.5
2023-01-18 13:18:35.791743: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 9687 MB memory: -> device: 2, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:88:00.0, compute capability: 7.5
2023-01-18 13:18:35.792678: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:3 with 9687 MB memory: -> device: 3, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:89:00.0, compute capability: 7.5
2023-01-18 13:18:35.804893: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:worker/replica:0/task:0/device:GPU:0 with 9687 MB memory: -> device: 0, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:3d:00.0, compute capability: 7.5
2023-01-18 13:18:35.805620: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:worker/replica:0/task:0/device:GPU:1 with 9687 MB memory: -> device: 1, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:3e:00.0, compute capability: 7.5
2023-01-18 13:18:35.806333: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:worker/replica:0/task:0/device:GPU:2 with 9687 MB memory: -> device: 2, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:88:00.0, compute capability: 7.5
2023-01-18 13:18:35.807029: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:worker/replica:0/task:0/device:GPU:3 with 9687 MB memory: -> device: 3, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:89:00.0, compute capability: 7.5
2023-01-18 13:18:35.810512: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job worker -> {0 -> g01:37672}
2023-01-18 13:18:35.810736: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:427] Started server with target: grpc://g01:37672
/usr/ebuild/software/TensorFlow/2.6.0-foss-2021a-CUDA-11.3.1/lib/python3.9/site-packages/keras/optimizer_v2/optimizer_v2.py:355: UserWarning: The `lr` argument is deprecated, use `learning_rate` instead.
warnings.warn(
WARNING:tensorflow:`period` argument is deprecated. Please use `save_freq` to specify the frequency in number of batches seen.
2023-01-18 13:18:42.547198: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:695] AUTO sharding policy will apply DATA sharding policy as it failed to apply FILE sharding policy because of the following reason: Found an unshardable source dataset: name: "TensorSliceDataset/_2"
op: "TensorSliceDataset"
input: "Placeholder/_0"
input: "Placeholder/_1"
attr {
key: "Toutput_types"
value {
list {
type: DT_DOUBLE
type: DT_DOUBLE
}
}
}
attr {
key: "output_shapes"
value {
list {
shape {
dim {
size: 15
}
}
shape {
dim {
size: 13
}
}
}
}
}
2023-01-18 13:18:42.740015: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
[g01:44037:0:44313] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace (tid: 44313) ====
0 0x000000000002137e ucs_debug_print_backtrace() /umbc/ebuild-soft/cascade-lake/build/UCX/1.10.0/GCCcore-10.3.0/ucx-1.10.0/src/ucs/debug/debug.c:656
1 0x000000000382045b tensorflow::NcclCommunicator::Enqueue() collective_communicator.cc:0
2 0x0000000005c9f88a tensorflow::NcclReducer::Run() ???:0
3 0x00000000009086dc tensorflow::BaseCollectiveExecutor::ExecuteAsync(tensorflow::OpKernelContext*, tensorflow::CollectiveParams const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::function<void (tensorflow::Status const&)>)::{lambda()#3}::operator()() base_collective_executor.cc:0
4 0x0000000000b99403 tensorflow::UnboundedWorkQueue::PooledThreadFunc() ???:0
5 0x0000000000b9f6b1 tensorflow::(anonymous namespace)::PThread::ThreadFn() env.cc:0
6 0x0000000000007ea5 start_thread() pthread_create.c:0
7 0x00000000000feb0d __clone() ???:0
=================================
[g01:44037] *** Process received signal ***
[g01:44037] Signal: Segmentation fault (11)
[g01:44037] Signal code: (-6)
[g01:44037] Failing at address: 0x2ecf70000ac05
[g01:44037] [ 0] /lib64/libpthread.so.0(+0xf630)[0x2aaaab7e6630]
[g01:44037] [ 1] /usr/ebuild/software/TensorFlow/2.6.0-foss-2021a-CUDA-11.3.1/lib/python3.9/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(+0x382045b)[0x2aaab68fc45b]
[g01:44037] [ 2] /usr/ebuild/software/TensorFlow/2.6.0-foss-2021a-CUDA-11.3.1/lib/python3.9/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow11NcclReducer3RunESt8functionIFvRKNS_6StatusEEE+0x1ca)[0x2aaab8d7b88a]
[g01:44037] [ 3] /usr/ebuild/software/TensorFlow/2.6.0-foss-2021a-CUDA-11.3.1/lib/python3.9/site-packages/tensorflow/python/../libtensorflow_framework.so.2(+0x9086dc)[0x2aaadc7556dc]
[g01:44037] [ 4] /usr/ebuild/software/TensorFlow/2.6.0-foss-2021a-CUDA-11.3.1/lib/python3.9/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow18UnboundedWorkQueue16PooledThreadFuncEv+0x1b3)[0x2aaadc9e6403]
[g01:44037] [ 5] /usr/ebuild/software/TensorFlow/2.6.0-foss-2021a-CUDA-11.3.1/lib/python3.9/site-packages/tensorflow/python/../libtensorflow_framework.so.2(+0xb9f6b1)[0x2aaadc9ec6b1]
[g01:44037] [ 6] /lib64/libpthread.so.0(+0x7ea5)[0x2aaaab7deea5]
[g01:44037] [ 7] /lib64/libc.so.6(clone+0x6d)[0x2aaaac468b0d]
[g01:44037] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 44037 on node g01 exited on signal 11 (Segmentation fault).
from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
# Load in the parameter files
from json import load as loadf
with open("params.json", 'r') as inFile:
params = loadf(inFile)
# Get data files and prep them for the generator
from tensorflow import distribute as D
callbacks = []
devices = getDevices()
print(devices)
set_tf_config_mpi()
strat = D.experimental.MultiWorkerMirroredStrategy(
communication=D.experimental.CollectiveCommunication.NCCL)
# Create network
from sys import argv
resume_training = False
print(argv)
if "resume_latest" in argv:
resume_training = True
with strat.scope():
# Scheduler
if isinstance(params["learning_rate"], str):
# Get the string for the importable function
lr = params["learning_rate"]
from tensorflow.keras.callbacks import LearningRateScheduler
# Use a dummy learning rate
params["learning_rate"] = 0.1
# model = create_model(**params)
# Get the importable function
lr = lr.split(".")
baseImport = __import__(lr[0], globals(), locals(), [lr[1]], 0)
lr = getattr(baseImport, lr[1])
# Make a schedule
lr = LearningRateScheduler(lr)
callbacks.append(lr)
# Resume Model?
model_name = None
if resume_training:
initial_epoch, model_name = getInitialEpochsAndModelName(rank)
if model_name is None:
initial_epoch=0
model = create_model(**params)
resume_training = False
else:
from tensorflow.keras.models import load_model
model = load_model(model_name)
# Load data from disk
import numpy
if "root" in params.keys():
root = params['root']
else:
root = "./"
if "filename" in params.keys():
filename = params["filename"]
else:
filename = "dataset_timeseries.csv"
restricted = [
'euc1', 'e1', 'x1', 'y1', 'z1',
'euc2', 'e2', 'x2', 'y2', 'z2',
'euc3', 'e3', 'x3', 'y3', 'z3',
]
x, y = getOneHot("{}/{}".format(root, filename), restricted=restricted, **params)
# val_x, val_y = getOneHot("{}/{}".format(root, val_filename), restricted=restricted)
val_x, val_y = None, None
params["gbatch_size"] = params['batch_size'] * len(devices)
print("x.shape =", x.shape)
print("y.shape =", y.shape)
print("epochs =", params['epochs'], type(params['epochs']))
print("batch =", params['batch_size'], type(params['batch_size']))
print("gbatch =", params['gbatch_size'], type(params['gbatch_size']))
# Load data into a distributed dataset
# Dataset object does nothing in place:
# https://stackoverflow.com/questions/55645953/shape-of-tensorflow-dataset-data-in-keras-tensorflow-2-0-is-wrong-after-conver
from tensorflow.data import Dataset
data = Dataset.from_tensor_slices((x, y))
# Create validation set
v = params['validation']
if val_x is not None:
vrecord = val_x.shape[0]
val = Dataset.from_tensor_slices((val_x, val_y))
validation = val # data.take(vrecord)
else:
vrecord = int(x.shape[0]*v)
validation = data.take(vrecord)
validation = validation.batch(params['gbatch_size'])
validation = validation.repeat(params['epochs'])
# Validation -- need to do kfold one day
# This set should NOT be distributed
vsteps = vrecord // params['gbatch_size']
if vrecord % params['gbatch_size'] != 0:
vsteps += 1
# Shuffle the data during preprocessing or suffer...
# Parallel randomness == nightmare
# data = data.shuffle(x.shape[0])
# Ordering these two things is very important!
# Consider 3 elements, batch size 2 repeat 2
# [1 2 3] -> [[1 2] [3]] -> [[1 2] [3] [1 2] [3]] (correct) batch -> repeat
# [1 2 3] -> [1 2 3 1 2 3] -> [[1 2] [3 1] [2 3]] (incorrect) repeat -> batch
# data = data.skip(vrecord)
data = data.batch(params['gbatch_size'])
data = data.repeat(params['epochs'])
records = x.shape[0] # - vrecord
steps = records // params['gbatch_size']
if records % params['gbatch_size']:
steps += 1
print("steps =", steps)
# Note that if we are resuming that the number of _remaining_ epochs has
# changed!
# The number of epochs * steps is the numbers of samples to drop
print("initial cardinality = ", data.cardinality())
print("initial v cardinality = ", data.cardinality())
data = data.skip(initial_epoch*steps)
validation = validation.skip(initial_epoch*vsteps)
print("final cardinality = ", data.cardinality())
print("final v cardinality = ", data.cardinality())
# data = strat.experimental_distribute_dataset(data)
# Split into validation and training
callbacks = createCallbacks(params, callbacks, rank, resume_training)
print(callbacks)
history = model.fit(data, epochs=params['epochs'],
batch_size=params['gbatch_size'],
steps_per_epoch=steps,
verbose=0,
initial_epoch=initial_epoch,
validation_data=validation,
validation_steps=vsteps,
callbacks=callbacks)
if rank == 0:
model.save("model-final")
else:
model.save("checkpoints/model-tmp")

During use gluoncv and mxnet, gpu is not working

Version I used:
python 3.6.5
mxnet 1.5.0
cuda 9.2 (I also installed cuda 11.4 and cudnn 8.2.4 because I checked cmd and my NVIDIA used it)
cudnn 7.6.5
window10 64bit
Question:
I used mxnet and gluoncv for image segmentation and gpu problem occured consistently.
I install and uninstall almost every cuda versions(and cudnns) but it didn't help.
plus, I'm little confused that should I use mxnet-cu92 or something else?
when I first installed cuda 11.4, I installed mxnet-cu101(mxnet-cu112 didn't work for me)
but I found cu92 is for using gpu so I installed it again with cuda9.2.
and still not working
here is my code
ctx = mx.gpu(0)
model = gluoncv.model_zoo.get_model('fcn_resnet50_ade', pretrained=True, ctx=ctx) #deeplab_resnet101_ade #fcn_resnet50_ade
total_df = pd.DataFrame(columns=ADE20KSegmentation.CLASSES)
start = time.time()
Moly = []
Fences = {}
for i in range(len(image_file)):
if i%100==0:
print(i)
print(time.time()-start)
start = time.time()
img = mx.image.imread(image_file[i])
image = test_transform(mx.img.imresize(img, 1200, 1200), ctx)
output_array = model.predict(image)
predict_index = mx.nd.argmax(output_array,1).asnumpy()
holy = find_fence(predict_index)
Moly.append(holy)
flat = predict_index.flatten()
output_dict = {}
for index, cls in enumerate(ADE20KSegmentation.CLASSES):
num_pixel = len(np.where(flat==index)[0])
output_dict[cls] = round(num_pixel/1440000, 4)
total_df = total_df.append(output_dict, ignore_index=True)
for names, holy in zip(image_names, Moly):
Fences[names] = holy
and I got "MXNetError: C:\Jenkins\workspace\mxnet-tag\mxnet\src\ndarray\ndarray.cc:1285: GPU is not enabled" this error on
model = gluoncv.model_zoo.get_model('fcn_resnet50_ade', pretrained=True, ctx=ctx)
this code.
what should I do now...?

How to automatically assign free GPUs in TensorFlow

I have 4 Tesla K80 GPUs in my system. I would like to automatically allocate free GPUs based on an integer input in the code. I am aware of tf.config.experimental.set_visible_devices() to assign specific GPUs but currently do not know how to identify which of the GPUs are in-use (expect manually using nvidia-smi). I am currently changing the code below for every run.
import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
# Restrict TensorFlow to only use the first GPU
try:
tf.config.experimental.set_visible_devices(gpus[2:], 'GPU')
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPU")
except RuntimeError as e:
# Visible devices must be set before GPUs have been initialized
print(e)
The above code lets me set the GPUs I want to allocate (GPU 2,3 in above example) for the run. Is there anyway to obtain a list of free (unused) devices to automate the allocation process instead manually having to identify which of the devices should be set?
I am currently using TensorFlow version 1.15
import subprocess, re
import os
import utils
# Nvidia-smi GPU memory parsing.
# Tested on nvidia-smi 370.23
# TF1.15
def run_command(cmd):
"""Run command, return output as string."""
output = subprocess.Popen(cmd, stdout=subprocess.PIPE, shell=True).communicate()[0]
return output.decode("ascii")
def list_available_gpus():
"""Returns list of available GPU ids."""
output = run_command("nvidia-smi -L")
# lines of the form GPU 0: TITAN X
gpu_regex = re.compile(r"GPU (?P<gpu_id>\d+):")
result = []
for line in output.strip().split("\n"):
m = gpu_regex.match(line)
assert m, "Couldnt parse "+line
result.append(int(m.group("gpu_id")))
return result
def gpu_memory_map():
"""Returns map of GPU id to memory allocated on that GPU."""
output = run_command("nvidia-smi")
gpu_output = output[output.find("GPU Memory"):]
# lines of the form
# | 0 8734 C python 11705MiB |
memory_regex = re.compile(r"[|]\s+?(?P<gpu_id>\d+)\D+?(?P<pid>\d+).+[ ](?P<gpu_memory>\d+)MiB")
rows = gpu_output.split("\n")
result = {gpu_id: 0 for gpu_id in list_available_gpus()}
for row in gpu_output.split("\n"):
m = memory_regex.search(row)
if not m:
continue
gpu_id = int(m.group("gpu_id"))
gpu_memory = int(m.group("gpu_memory"))
result[gpu_id] += gpu_memory
return result
def pick_gpu_lowest_memory():
"""Returns GPU with the least allocated memory"""
memory_gpu_map = [(memory, gpu_id) for (gpu_id, memory) in gpu_memory_map().items()]
best_memory, best_gpu = sorted(memory_gpu_map)[0]
return best_gpu
def pick_free_gpus(num_gpus=1):
"""Returns free GPUs with the least allocated memory"""
memory_gpu_map = [(memory, gpu_id) for (gpu_id, memory) in gpu_memory_map().items()]
sorted_list = sorted(memory_gpu_map)
gpu_list = []
for i in range(num_gpus):
if sorted_list[i][0] == 0:
gpu_list.append(sorted_list[i][1])
else:
print(f'Currently fewer than {num_gpus} GPUs are free right now, choose {i} or fewer GPUs')
exit()
return ','.join(map(str, gpu_list))
num_gpus = 2
os.environ["CUDA_VISIBLE_DEVICES"] = pick_free_gpus(num_gpus)
import tensorflow as tf
tf.config.optimizer.set_jit(True) # Enable XLA.
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
# Restrict TensorFlow to only use the first GPU
try:
tf.config.experimental.set_visible_devices(gpus, 'GPU')
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPU")
except RuntimeError as e:
# Visible devices must be set before GPUs have been initialized
print(e)

use estimator got InvalidArgumentError Cannot assign a device for operation and allow_soft_placement: true doesn't work

def create_hparams():
return trainer_lib.create_hparams(
FLAGS.hparams_set,
FLAGS.hparams,
data_dir=os.path.expanduser(FLAGS.data_dir),
problem_name=FLAGS.problem)
def create_decode_hparams():
decode_hp = decoding.decode_hparams(FLAGS.decode_hparams)
decode_hp.shards = FLAGS.decode_shards
decode_hp.shard_id = FLAGS.worker_id
decode_in_memory = FLAGS.decode_in_memory or decode_hp.decode_in_memory
decode_hp.decode_in_memory = decode_in_memory
decode_hp.decode_to_file = FLAGS.decode_to_file
decode_hp.decode_reference = FLAGS.decode_reference
return decode_hp
hp = create_hparams()
decode_hp = create_decode_hparams()
run_conf = t2t_trainer.create_run_config(hp)
estimator = trainer_lib.create_estimator(
FLAGS.model,
hp,
run_conf,
decode_hparams=decode_hp,
use_tpu=FLAGS.use_tpu)
print(run_conf.session_config)
def input_fn():
inputs = tf.placeholder(tf.int32, shape=(1, None, 1, 1), name="inputs")
input_tensor = {'inputs': inputs }
return tf.estimator.export.ServingInputReceiver(input_tensor, input_tensor)
predictor=tf.contrib.predictor.from_estimator(estimator, input_fn)
I got output of
InvalidArgumentError: Cannot assign a device for operation
transformer/body/parallel_0/body/encoder/layer_0/self_attention/multihead_attention/dot_product_attention/attention: Could not satisfy explicit device specification '/device:GPU:0'
because no supported kernel for GPU devices is available. Colocation
Debug Info: Colocation group had the following types and supported
devices: Root Member(assigned_device_name_index_=-1
requested_device_name_='/device:GPU:0' assigned_device_name_=''
resource_device_name_='' supported_device_types_=[CPU]
possible_devices_=[] ImageSummary: CPU
Colocation members, user-requested devices, and framework assigned
devices, if any:
transformer/body/parallel_0/body/encoder/layer_0/self_attention/multihead_attention/dot_product_attention/attention
(ImageSummary) /device:GPU:0
Op: ImageSummary Node attrs: max_images=1, T=DT_FLOAT,
bad_color=Tensor Registered
kernels: device='CPU'
when i print the run_conf.session_config, I got allow_soft_placement: true. Many people said it can solve the problem of InvalidArgumentError but seems not work on me.

Selecting the device to be used by a graph in TensorFlow

Given a code which uses multiple graphs or multiple versions of the same graph, it is sometimes necessary to ensure that a particular graph uses only CPU for computation, while some other graph uses only GPU.
The basic question is
How to make sure that a particular graph makes use of only CPU (XOR) GPU for computations ?
There is not an exhaustive discussion of this topic on SO and hence this question.
I have tried a number of different approaches and none seem to work as will be outlined below.
Before further details on the question and various options that have been tried, I will lay down following details :-
TensorFlow Version : 'v1.1.0-rc2-1003-g3792dd9' 1.1.0-rc2 (Compiled from source)
OS details : CentOS Linux release 7.2.1511 (Core)
Bazel version : 0.4.5
Basic code with which various approaches have been tried is mentioned below :
import tensorflow as tf
from tensorflow.python.client import timeline
import matplotlib.pyplot as plt
def coloraugment(image):
output = tf.image.random_brightness(image, max_delta=10./255.)
output = tf.clip_by_value(output, 0.0, 1.0)
output = tf.image.random_saturation(output, lower=0.5, upper=1.5)
output = tf.clip_by_value(output, 0.0, 1.0)
output = tf.image.random_contrast(output, lower=0.5, upper=1.5)
output = tf.clip_by_value(output, 0.0, 1.0)
return output
def augmentbody(image, sz):
for i in range(10):
if i == 0:
cropped = tf.random_crop(value=image, size=sz)
croppedflipped = tf.image.flip_left_right(cropped)
out = tf.stack([cropped, croppedflipped], axis=0)
else:
cropimg = tf.random_crop(value=image, size=sz)
augcolor = coloraugment(cropimg)
augflipped = tf.image.flip_left_right(augcolor)
coll = tf.stack([augcolor, augflipped], axis=0)
out = tf.concat([coll, out], axis=0)
out = tf.random_shuffle(out)
return out
def aspect1(aspectratio):
newheight = tf.constant(256, dtype=tf.float32)
newwidth = tf.divide(newheight, aspectratio)
newsize = tf.stack([newheight, newwidth], axis=0)
newsize = tf.cast(newsize, dtype=tf.int32)
return newsize
def aspect2(aspectratio):
newwidth = tf.constant(256, dtype=tf.float32)
newheight = tf.multiply(newwidth, aspectratio)
newsize = tf.stack([newheight, newwidth], axis=0)
newsize = tf.cast(newsize, dtype=tf.int32)
return newsize
def resize_image(image):
imageshape = tf.shape(image)
imageheight = tf.cast(tf.gather(imageshape, tf.constant(0, dtype=tf.int32)),
dtype=tf.float32)
imagewidth = tf.cast(tf.gather(imageshape, tf.constant(1, dtype=tf.int32)),
dtype=tf.float32)
aspectratio = tf.divide(imageheight, imagewidth)
newsize = tf.cond(tf.less_equal(imageheight, imagewidth),
lambda: aspect1(aspectratio),
lambda: aspect2(aspectratio))
image = tf.image.convert_image_dtype(image, dtype=tf.float32)
image = tf.image.resize_images(image, newsize)
return image
def readimage(file_queue):
reader = tf.WholeFileReader()
key, value = reader.read(file_queue)
image = tf.image.decode_jpeg(value)
image = resize_image(image)
return image
if __name__ == "__main__":
queue = tf.train.string_input_producer(["holly2.jpg"])
image = readimage(queue)
augmented = augmentbody(image, [221,221,3])
init_op = tf.global_variables_initializer()
config_cpu = tf.ConfigProto()
config = tf.ConfigProto(
device_count = {'GPU': 0}
)
sess_cpu = tf.Session(config=config)
with tf.Session(config=config_cpu) as sess:
run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
run_metadata = tf.RunMetadata()
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)
sess.run(init_op)
[tense] = sess.run([augmented],options=run_options, run_metadata=run_metadata)
coord.request_stop()
coord.join(threads)
tl = timeline.Timeline(run_metadata.step_stats)
ctf = tl.generate_chrome_trace_format()
with open('timeline.json', 'w') as f:
f.write(ctf)
print("The tensor size is {}".format(tense.shape))
numcols = tense.shape[0]/2
for i in range(tense.shape[0]):
plt.subplot(2,numcols,i+1)
plt.imshow(tense[i, :, :, :])
plt.show()
plt.close()
Various approaches which have been tried
Various related questions exist on SO with accepted answers, but they
do not seem to work very well as I outline with examples and outputs.
Various tried approaches
Approach 1
Related question is ( Run Tensorflow on CPU ). The accepted answer is to run tf.Session() with the following configuration :
config = tf.ConfigProto(
device_count = {'GPU': 0}
)
sess = tf.Session(config=config)
The corresponding output is :
2017-05-18 13:34:27.477189: I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] Found device 0 with properties:
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.7715
pciBusID 0000:04:00.0
Total memory: 7.92GiB
Free memory: 7.80GiB
2017-05-18 13:34:27.477232: I tensorflow/core/common_runtime/gpu/gpu_device.cc:927] DMA: 0
2017-05-18 13:34:27.477240: I tensorflow/core/common_runtime/gpu/gpu_device.cc:937] 0: Y
2017-05-18 13:34:27.477259: I tensorflow/core/common_runtime/gpu/gpu_device.cc:996] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:04:00.0)
2017-05-18 13:34:27.482600: I tensorflow/core/common_runtime/gpu/gpu_device.cc:996] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:04:00.0)
2017-05-18 13:34:27.848864: I tensorflow/compiler/xla/service/platform_util.cc:58] platform CUDA present with 1 visible devices
2017-05-18 13:34:27.848902: I tensorflow/compiler/xla/service/platform_util.cc:58] platform Host present with 40 visible devices
2017-05-18 13:34:27.851670: I tensorflow/compiler/xla/service/service.cc:184] XLA service 0x7f0fd81d5500 executing computations on platform Host. Devices:
2017-05-18 13:34:27.851688: I tensorflow/compiler/xla/service/service.cc:192] StreamExecutor device (0): <undefined>, <undefined>
2017-05-18 13:34:27.851894: I tensorflow/compiler/xla/service/platform_util.cc:58] platform CUDA present with 1 visible devices
2017-05-18 13:34:27.851903: I tensorflow/compiler/xla/service/platform_util.cc:58] platform Host present with 40 visible devices
2017-05-18 13:34:27.854698: I tensorflow/compiler/xla/service/service.cc:184] XLA service 0x7f0fd82b4c50 executing computations on platform CUDA. Devices:
2017-05-18 13:34:27.854713: I tensorflow/compiler/xla/service/service.cc:192] StreamExecutor device (0): GeForce GTX 1080, Compute Capability 6.1
2017-05-18 13:34:28.918980: I tensorflow/stream_executor/dso_loader.cc:139] successfully opened CUDA library libcupti.so.8.0 locally
You can clearly see that the GPU is still being used and the XLA
service is running on GPU
Approach 2
Related question is ( Run Tensorflow on CPU ). This answer states that the following environment variable can be set as follows to use CPU
CUDA_VISIBLE_DEVICES=""
When the GPU computation is required, it can be unset.
The corresponding output is
2017-05-18 13:42:24.871020: E tensorflow/stream_executor/cuda/cuda_driver.cc:406] failed call to cuInit: CUDA_ERROR_NO_DEVICE
2017-05-18 13:42:24.871071: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:158] retrieving CUDA diagnostic information for host: nefgpu12
2017-05-18 13:42:24.871081: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname: nefgpu12
2017-05-18 13:42:24.871114: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported version is: 367.48.0
2017-05-18 13:42:24.871147: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:369] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module 367.48 Sat Sep 3 18:21:08 PDT 2016
GCC version: gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC)
"""
2017-05-18 13:42:24.871170: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:193] kernel reported version is: 367.48.0
2017-05-18 13:42:24.871178: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:300] kernel version seems to match DSO: 367.48.0
2017-05-18 13:42:25.159632: W tensorflow/compiler/xla/service/platform_util.cc:61] platform CUDA present but no visible devices found
2017-05-18 13:42:25.159674: I tensorflow/compiler/xla/service/platform_util.cc:58] platform Host present with 40 visible devices
2017-05-18 13:42:25.162626: I tensorflow/compiler/xla/service/service.cc:184] XLA service 0x7f5798002df0 executing computations on platform Host. Devices:
2017-05-18 13:42:25.162663: I tensorflow/compiler/xla/service/service.cc:192] StreamExecutor device (0): <undefined>, <undefined>
2017-05-18 13:42:25.223309: I tensorflow/stream_executor/dso_loader.cc:139] successfully opened CUDA library libcupti.so.8.0 locally
You can see from this output that the GPU is not being used.
Approach 3
The related question is ( Running multiple graphs in different device modes in TensorFlow ). One answer gives the following solution :
# The config for CPU usage
config_cpu = tf.ConfigProto()
config_cpu.gpu_options.visible_device_list=''
sess_cpu = tf.Session(config=config_cpu)
# The config for GPU usage
config_gpu = tf.ConfigProto()
config_gpu.gpu_options.visible_device_list='0'
sess_gpu = tf.Session(config=config_gpu)
The output of using the configuration for CPU usage as outlined in the solution is as follows :
2017-05-18 13:50:32.999431: I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] Found device 0 with properties:
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.7715
pciBusID 0000:04:00.0
Total memory: 7.92GiB
Free memory: 7.80GiB
2017-05-18 13:50:32.999472: I tensorflow/core/common_runtime/gpu/gpu_device.cc:927] DMA: 0
2017-05-18 13:50:32.999478: I tensorflow/core/common_runtime/gpu/gpu_device.cc:937] 0: Y
2017-05-18 13:50:32.999490: I tensorflow/core/common_runtime/gpu/gpu_device.cc:996] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:04:00.0)
2017-05-18 13:50:33.084737: I tensorflow/core/common_runtime/gpu/gpu_device.cc:996] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:04:00.0)
2017-05-18 13:50:33.395798: I tensorflow/compiler/xla/service/platform_util.cc:58] platform CUDA present with 1 visible devices
2017-05-18 13:50:33.395837: I tensorflow/compiler/xla/service/platform_util.cc:58] platform Host present with 40 visible devices
2017-05-18 13:50:33.398634: I tensorflow/compiler/xla/service/service.cc:184] XLA service 0x7f08181ecfa0 executing computations on platform Host. Devices:
2017-05-18 13:50:33.398695: I tensorflow/compiler/xla/service/service.cc:192] StreamExecutor device (0): <undefined>, <undefined>
2017-05-18 13:50:33.398908: I tensorflow/compiler/xla/service/platform_util.cc:58] platform CUDA present with 1 visible devices
2017-05-18 13:50:33.398920: I tensorflow/compiler/xla/service/platform_util.cc:58] platform Host present with 40 visible devices
2017-05-18 13:50:33.401731: I tensorflow/compiler/xla/service/service.cc:184] XLA service 0x7f081821e1f0 executing computations on platform CUDA. Devices:
2017-05-18 13:50:33.401745: I tensorflow/compiler/xla/service/service.cc:192] StreamExecutor device (0): GeForce GTX 1080, Compute Capability 6.1
2017-05-18 13:50:34.484142: I tensorflow/stream_executor/dso_loader.cc:139] successfully opened CUDA library libcupti.so.8.0 locally
You can see that the GPU is still being used.
See issues #9201 and #2175. The fact that the GPU devices are created does not mean that your graph is necessarily running on the GPU. You can enforce CPU execution with device_count = {'GPU': 0} or tf.device, but the GPU devices are still created with the session, just in case some op wants it. About 'CUDA_VISIBLE_DEVICES', making it empty did not work for me either, but doing export CUDA_VISIBLE_DEVICES"-1" (before starting Python, or inside Python through os.environ before importing TensorFlow) did the trick (TensorFlow will output a warning about the GPU not being found, but it will work). You can see the documentation for CUDA_VISIBLE_DEVICES here.