I add a new gpu kernel op(MyGpuOp), the use op in python, and specify device id by tf.device('/gpu:0'
example :
with tf.device('/gpu:0') :
out1 = my_gpu_op(input1)
with tf.device('/gpu:1') :
out2 = my_gpu_op(input2)
then i use sess.run([out1, out2], the seens won't run concurrent in different Gpu Device. beacause the run time of sess.run[out1, out2] is twice of sess.run(out1).
in c++ op wraps(MyGpuOp), i run cuda kernel by pass steam(CudaStream_t)
like this :
//ctx is OpKernelContext
GPUDevice d = ctx->eigen_device<GPUDevice>();
CudaStream_t stream = d.stream();
MyKernelName<<<grid, block, 0, stream>>> (....);
Should I need get device id, then use CudaSetDevice(device_id) before run kernel fun?
Related
I’m running an object detection routine on a server.
I set the context to the GPU, and I'm loading the model, the parameters and the data on the GPU. The program is reading from a video file or from a rtsp stream, using OpenCV.
When using nvidia-smi, I see that the selected GPU usage is at 20%, which is reasonable. However, the object detection routine is still using 750-1200 % of the CPU (basically, all of the available cores of the server).
This is the code:
def main():
ctx = mx.gpu(3)
# -------------------------
# Load a pretrained model
# -------------------------
net = gcv.model_zoo.get_model('ssd_512_mobilenet1.0_coco', pretrained=True)
# Load the webcam handler
cap = cv2.VideoCapture("video/video_01.mp4")
count_frame = 0
while(True):
print(f"Frame: {count_frame}")
# Load frame from the camera
ret, frame = cap.read()
if (cv2.waitKey(25) & 0xFF == ord('q')) or (ret == False):
cv2.destroyAllWindows()
cap.release()
print("Done!!!")
break
# Image pre-processing
frame = mx.nd.array(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)).astype('uint8')
frame_nd, frame_np = gcv.data.transforms.presets.ssd.transform_test(frame, short=512, max_size=700)
if isinstance(frame_nd, mx.ndarray.ndarray.NDArray):
frame_nd.wait_to_read()
# Run frame through network
frame_nd = frame_nd.as_in_context(ctx)
class_IDs, scores, bounding_boxes = net(frame_nd)
if isinstance(class_IDs, mx.ndarray.ndarray.NDArray):
class_IDs.wait_to_read()
if isinstance(scores, mx.ndarray.ndarray.NDArray):
scores.wait_to_read()
if isinstance(bounding_boxes, mx.ndarray.ndarray.NDArray):
bounding_boxes.wait_to_read()
count_frame += 1
cv2.destroyAllWindows()
cap.release()
This is the output of nvidia-smi:
while this is the output of top:
The pre-processing operations are running on the CPU:
frame = mx.nd.array(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)).astype('uint8')
frame_nd, frame_np = gcv.data.transforms.presets.ssd.transform_test(frame, short=512, max_size=700)
but is it enough to justify such a high CPU usage? In case, can I run them on GPU as well?
EDIT: I modified and copied the whole code, in response to Olivier_Cruchant's comment (thanks!)
Your CPU is likely busy because of the pre-processing load and frequent back-and-forth from memory to GPU because inference seems to be running frame-by-frame
I would suggest to try the following:
Run a batched inference (send a batch of N frames to the network) to
increase GPU usage and reduce communication
Try using NVIDIA DALI to
better use GPU for data ingestion and pre-processing (DALI MXNet reference, DALI mp4 ingestion pytorch example)
I have several GPUs but I only want to use one GPU for my training. I am using following options:
config = tf.ConfigProto(allow_soft_placement=True, log_device_placement=True)
config.gpu_options.allow_growth = True
with tf.Session(config=config) as sess:
Despite setting / using all these options, all of my GPUs allocate memory and
#processes = #GPUs
How can I prevent this from happening?
Note
I do not want use set the devices manually and I do not want to set CUDA_VISIBLE_DEVICES since I want tensorflow to automatically find the best (an idle) GPU available
When I try to start another run it uses the same GPU that is already used by another tensorflow process even though there are several other free GPUs (apart from the memory allocation on them)
I am running tensorflow in a docker container: tensorflow/tensorflow:latest-devel-gpu-py
I had this problem my self. Setting config.gpu_options.allow_growth = True
Did not do the trick, and all of the GPU memory was still consumed by Tensorflow.
The way around it is the undocumented environment variable TF_FORCE_GPU_ALLOW_GROWTH (I found it in
https://github.com/tensorflow/tensorflow/blob/3e21fe5faedab3a8258d344c8ad1cec2612a8aa8/tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc#L25)
Setting TF_FORCE_GPU_ALLOW_GROWTH=true works perfectly.
In the Python code, you can set
os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true'
I can offer you a method mask_busy_gpus defined here: https://github.com/yselivonchyk/TensorFlow_DCIGN/blob/master/utils.py
Simplified version of the function:
import subprocess as sp
import os
def mask_unused_gpus(leave_unmasked=1):
ACCEPTABLE_AVAILABLE_MEMORY = 1024
COMMAND = "nvidia-smi --query-gpu=memory.free --format=csv"
try:
_output_to_list = lambda x: x.decode('ascii').split('\n')[:-1]
memory_free_info = _output_to_list(sp.check_output(COMMAND.split()))[1:]
memory_free_values = [int(x.split()[0]) for i, x in enumerate(memory_free_info)]
available_gpus = [i for i, x in enumerate(memory_free_values) if x > ACCEPTABLE_AVAILABLE_MEMORY]
if len(available_gpus) < leave_unmasked: ValueError('Found only %d usable GPUs in the system' % len(available_gpus))
os.environ["CUDA_VISIBLE_DEVICES"] = ','.join(map(str, available_gpus[:leave_unmasked]))
except Exception as e:
print('"nvidia-smi" is probably not installed. GPUs are not masked', e)
Usage:
mask_unused_gpus()
with tf.Session()...
Prerequesities: nvidia-smi
With this script I was solving next problem: on a multy-GPU cluster use only single (or arbitrary) number of GPUs allowing them to be automatically allocated.
Shortcoming of the script: if you are starting multiple scripts at once random assignment might cause same GPU assignment, because script depends on memory allocation and memory allocation takes some seconds to kick in.
I am running a large distributed Tensorflow model in google cloud ML engine. I want to use machines with GPUs.
My graph consists of two main the parts the input/data reader function and the computation part.
I wish to place variables in the PS task, the input part in the CPU and the computation part on the GPU.
The function tf.train.replica_device_setter automatically places variables in the PS server.
This is what my code looks like:
with tf.device(tf.train.replica_device_setter(cluster=cluster_spec)):
input_tensors = model.input_fn(...)
output_tensors = model.model_fn(input_tensors, ...)
Is it possible to use tf.device() together with replica_device_setter() as in:
with tf.device(tf.train.replica_device_setter(cluster=cluster_spec)):
with tf.device('/cpu:0')
input_tensors = model.input_fn(...)
with tf.device('/gpu:0')
tensor_dict = model.model_fn(input_tensors, ...)
Will the replica_divice_setter() be overridden and variables not placed in the PS server?
Furthermore, since the device names in the cluster are something like job:master/replica:0/task:0/gpu:0 how do I say to Tensorflow tf.device(whatever/gpu:0)?
Any operations, beyond variables, in the tf.train.replica_device_setter block are automatically pinned to "/job:worker", which will default to the first device managed by the first task in the "worker" job.
You can pin them to another device (or task) by using embedded device block:
with tf.device(tf.train.replica_device_setter(ps_tasks=2, ps_device="/job:ps",
worker_device="/job:worker")):
v1 = tf.Variable(1., name="v1") # pinned to /job:ps/task:0 (defaults to /cpu:0)
v2 = tf.Variable(2., name="v2") # pinned to /job:ps/task:1 (defaults to /cpu:0)
v3 = tf.Variable(3., name="v3") # pinned to /job:ps/task:0 (defaults to /cpu:0)
s = v1 + v2 # pinned to /job:worker (defaults to task:0/cpu:0)
with tf.device("/task:1"):
p1 = 2 * s # pinned to /job:worker/task:1 (defaults to /cpu:0)
with tf.device("/cpu:0"):
p2 = 3 * s # pinned to /job:worker/task:1/cpu:0
My desktop has two gpus which can run Tensorflow with specification /gpu:0 or /gpu:1. However, if I don't specify which gpu to run the code, Tensorflow will by default to call /gpu:0, as we all know.
Now I would like to setup the system such that it can assign gpu dynamically according to the free memory of each gpu. For example, if a script doesn't specify which gpu to run the code, the system first assigns /gpu:0 for it; then if another script runs now, it will check whether /gpu:0 has enough free memory. If yes, it will continue assign /gpu:0 to it, otherwise it will assign /gpu:1 to it. How can I achieve it?
Follow-ups:
I believe the question above may be related to the virtualization problem of GPU. That is to say, if I can virtualize multi-gpu in a desktop into one GPU, I can get what I want. So beside any setup methods for Tensorflow, any ideas about virtualization is also welcome.
TensorFlow generally assumes it's not sharing GPU with anyone, so I don't see a way of doing it from inside TensorFlow. However, you could do it from outside as follows -- shell script that calls nvidia-smi, parses out GPU k with more memory, then sets "CUDA_VISIBLE_DEVICES=k" and calls TensorFlow script
Inspired by:
How to set specific gpu in tensorflow?
def leave_gpu_with_most_free_ram():
try:
command = "nvidia-smi --query-gpu=memory.free --format=csv"
memory_free_info = _output_to_list(sp.check_output(command.split()))[1:]
memory_free_values = [int(x.split()[0]) for i, x in enumerate(memory_free_info)]
least_busy_idx = memory_free_values.index(max(memory_free_values))
# update CUDA variable
gpus =[least_busy_idx]
setting = ','.join(map(str, gpus))
os.environ["CUDA_VISIBLE_DEVICES"] = setting
print('Left next %d GPU(s) unmasked: [%s] (from %s available)'
% (leave_unmasked, setting, str(available_gpus)))
except FileNotFoundError as e:
print('"nvidia-smi" is probably not installed. GPUs are not masked')
print(e)
except sp.CalledProcessError as e:
print("Error on GPU masking:\n", e.output)
Add a call to this function before importing tensorflow
In tensorflow I register an op like so:
REGISTER_OP("RimeBSqrt")
.Input("stokes: FT")
.Input("alpha: FT")
.Input("frequency: FT")
.Input("ref_freq: FT")
.Output("b_sqrt: CT")
.Attr("FT: {float, double} = DT_FLOAT")
.Attr("CT: {complex64, complex128} = DT_COMPLEX64");
All of the above inputs are tensors,
but ref_freq is a scalar or 0-D tensor.
In the Compute() method of my CPU kernel
I can do the following to extract the scalar:
const Tensor & in_ref_freq = context->input(3);
FT ref_freq = in_ref_freq.tensor<FT, 1>()(0);
However, the same kind of code generates a segfault
in the Compute() method of my GPU kernel, because
the CPU now tries to access a block of memory on the
GPU device. Is there anyway to intercept this scalar
value before sending it into the GPU? I'd like to avoid
the following extra level of memory indirection in
a CUDA kernel:
template <typename FT>
__global__ void kernel(..., FT * ref_freq, ...)
{
FT value = ref_freq[0];
}
I don't think Attr is the approach to use for ref_freq since it is changeable, configurable value.
CPU Tensorflow kernel code is here.
GPU Tensorflow kernel code is here.
Python variable setup code is here
You can specify that one or more of the inputs to (or outputs from) a TensorFlow OpKernel are in "host memory", which allows you to access the value in the Compute() method. To do this you would modify your REGISTER_KERNEL_BUILDER() call to add a .HostMemory("ref_freq") instruction:
REGISTER_KERNEL_BUILDER(
Name("RimeBSqrt")
.Device(tensorflow::DEVICE_GPU)
.TypeConstraint<float>("FT")
.TypeConstraint<tensorflow::complex64>("CT")
.HostMemory("ref_freq"),
RimeBSqrt<tensorflow::GPUDevice, float, tensorflow::complex64>);