CUDA programming: Is occupancy the way to achieve GPU slicing among different process? - tensorflow

There are ways through which GPU sharing can be achieved. I came across occupancy. Can I use it to slice the GPU among the processes (e.g. tensorflow) which are sharing GPU? slice here means GPU resources are always dedicated to that process. Using occupancy I will get know GPU & SMs details and based on that I launch kernel stating that create blocks for these GPU resources.
I am using NVIDIA Corporation GK210GL [Tesla K80] with cuda 9 toolkit installed
Please suggest. Thanks!

There are ways through which GPU sharing can be achieved.
No there aren't. In general, there is no such thing as the type of GPU sharing that you envisage. There is the MPS server for MPI style multi process computing, but that is irrelevant in the context of running Tensorflow (see here for why MPS can't be used).
I came across occupancy. Can I use it to slice the GPU among the processes (e.g. tensorflow) which are sharing GPU?
No you can't. Occupancy is a performance metric. It has nothing whatsoever to do with the ability to share a GPUs resources amongst different processes,
Please suggest
Buy a second GPU.

Related

Since TensorflowJS can use the GPU via WebGL, why would I need an nVIDIA GPU?

So TensorFlowJS can use WebGL to do GPU computations and train deep learning models. Why isn't this more popular than using CUDA with an nVIDIA GPU? Most people just trying to prototype machine learning models would love to do so on their personal computer, but many of us resort to using expensive cloud services like AWS (although more recently Google Colab helps) for ML training if we don't have a computer with an nVIDIA GPU. I'm sure nVIDIA GPUs are faster than whatever GPU is in my Macbook, but probably any GPU will offer at least an order of magnitude speedup over even a fast CPU and allow for model prototyping, so why aren't well using WebGL GPGPU? There must be a catch I just don't know about.
WebGL backend uses GLSL language to define functions and upload data as shaders - it "works", but you pay huge cost to compile GSLS and upload shaders: warmup time for semi-complex models is immense (we're talking about minutes just to startup). And then memory overhead is 100-200% of what model would normally need - and for larger models, you're GPU memory bound, you don't want to waste that.
Btw, actual inference time once model is warmed up and it fits in memory is ok using WebGL
On the other hand nVidia CUDA libraries provide direct access to GPU, so TF compiled to use them is always going to be much more efficient.
Unfortunately, not many GPU vendors provide libraries like CUDA, so most ML is done on nVidia GPUs
Then there is a next level when you're using TPU instead of GPU - then there is no WebGL to start with
If I select WebGPU with the TFJS benchmark (https://tensorflow.github.io/tfjs/e2e/benchmarks/local-benchmark/index.html) it responds with "WebGPU is not supported. Please use Chrome Canary browser with flag "--enable-unsafe-webgpu" enabled...."
So when that's ready will it be competitive with CUDA? On my laptop it is about 15% faster than WebGL on that benchmark.

On an NVIDIA GPU with multiple graphics cards (K80 for example), why does torch.cuda.device_count() return 1?

I ran the following code on a Tesla K80, which as I understand consists of 2 GK210 graphics cards, each with 12GB of on chip ram, connected by something called a PLX switch. I am confused how at the pytorch level, the fact that there are two graphics cards is hidden from the user
import torch
torch.cuda.device_count() # 1
(my hunch is that tensorflow provides this same abstraction)
Follow up questions:
If I am training a model with pytorch, and I run nvidia-smi and see that the GPU is fully utilized, I would assume this means that both GK210's are at 100% utilization. How does pytorch distribute kernels across the two GK210's, and can I have faith that this is being done efficiently? (I.e. being done in a way that minimized data transfer between the two cards). Any resources that explain how this works would be much appreciated.
If I were writing a CUDA application, could I pin a CUDA stream to each card, and explicitly manage data transfers between the two cards?

By default, does TensorFlow use GPU/CPU simultaneously for computing or GPU only?

By default, TensorFlow will use our available GPU devices. That said, does TensorFlow use GPUs and CPUs simultaneously for computing, or GPUs for computing and CPUs for job handling (no matter how, CPUs are always active, as I think)?
Generally it uses both, the CPU and the GPU (assuming you are using a GPU-enabled TensorFlow). What actually gets used depends on the actual operations that your code is using.
For each operation available in TensorFlow, there are several "implementations" of such operation, generally a CPU implementation and a GPU one. Some operations only have CPU implementations as it makes no sense for a GPU implementation, but overall most operations are available for both devices.
If you make custom operations then you need to provide implementations that you want.
TensorFlow operations come packaged with a list of devices they can execute on and a list of associated priorities.
For example, a convolution is very conducive to computation on a GPU; but can still be done on a CPU whereas scalar additions should definitely be done on a CPU. You can override this selection using tf.Device and the key attached to the device of interest.
Someone correct me if I'm wrong.
But from what I'm aware TensorFlow only uses either GPU or CPU depending on what installation you ran. For example if you used pip install TensorFlow for python 2 or python3 -m pip install TensorFlow for python 3 you'll only use the CPU version.
Vise versa for GPU.
If you still have any questions or if this did not correctly answer your question feel free to ask me more.

Strategies for improving performance when using Tensorflow w / C++?

I'm fairly new to Tensorflow in and ML in general and am wondering what strategies I can use to increase performance of an application I am building.
My app is using the Tensorflow C++ interface, with a source compiled TF 0.11 libtensorflow_cc.so (built with bazel build -c opt --copt=-mavx and optionally adding --config=cuda) for either AVX or AVX + CUDA on Mac OS X 10.12.1, on an MacBook Pro 2.8 GHz Intel Core i7 (2 cores 8 threads) with 16GB ram and a Nvidia 750m w/ 2GB VRam)
My application is using Inception V3 model and pulling feature vectors from pool_3 layer. I'm decoding video frames via native API's and passing those in memory buffers to the C++ interface for TF and running them into a session.
I'm not currently batching, but I am caching my session and re-using it for each individual decoded frame / tensor submission. Ive noticed that both CPU and GPU performance is about the same, taking about 40 to 50 seconds to process 222 frames, which seems very slow to me. Ive confirmed CUDA is being invoked, loaded, and the GPU is functioning (or appears so).
Some questions:
In general what should I expect for reasonable performance time wise of TF doing a frame of Inception on a consumer laptop?
How much of a difference does batching make for these operations? For tensors of 1x299x299x3 , I imagine I am doing more PCI transfer waiting than waiting on for meaningful work from the GPU?
if so Is there a good example of batching under C++ for InceptionV3?
Is there operations that cause additional CPU->GPU Syncronization that might otherwise be avoided?
Is there a way to ensure my sessions / graphs share resources ? Can I use nested scopes somehow in this manner? I couldn't quite get that to work but likely missed something.
Any good documentation of general strategies for things to do / avoid?
My code is below:
https://github.com/Synopsis/Synopsis/blob/TensorFlow/Synopsis/TensorFlowAnalyzer/TensorFlowAnalyzer.mm
Thank you very much
For reference, OpenCV analysis using perceptual hash, histogram, dense optical flow, sparse optical flow for point tracking, and simple saliency detection takes 4 to 5 seconds for the same 222 frames using CPU or CPU + OpenCL.
https://github.com/Synopsis/Synopsis/tree/TensorFlow/Synopsis/StandardAnalyzer
Answering your last question first, if there's documentation about performance optimization, yes:
The TensorFlow Performance Guide
The TensorFlow GPU profiling hints
Laptop performance is highly variable, and TF isn't particularly optimized for laptop GPUs. The numbers you're getting (222 frames in 40-50 seconds) ~= 5 fps don't seem crazy on a laptop platform, using the 2016 version of TensorFlow, with inception. With some of the performance improvements outlined in the performance guide above, that should probably be doubled in late 2017.
For batching, yes - the newer example inception model code allows a variable batch size at inference time. This is mostly about whether the model itself was defined to handle a batch size, which is something improved since 2016.
Batching for inference will make a pretty big difference on GPU. Whether it helps on CPU depends a lot -- for example, if you build with MKL-DNN support, batching should be considered mandatory, but basic TensorFlow may not benefit as much.

Configuring Tensorflow to use all CPU's

Reading :
https://www.tensorflow.org/versions/r0.10/resources/faq.html it states :
Does TensorFlow make use of all the devices (GPUs and CPUs) available
on my machine?
TensorFlow supports multiple GPUs and CPUs. See the how-to
documentation on using GPUs with TensorFlow for details of how
TensorFlow assigns operations to devices, and the CIFAR-10 tutorial
for an example model that uses multiple GPUs.
Note that TensorFlow only uses GPU devices with a compute capability
greater than 3.5.
Does this mean Tensorflow can automatically make use of all CPU's on given machine or does it ned to be explicitly configured ?
CPUs are used via a "device" which is just a threadpool. You can control the number of threads if you feel like you need more:
sess = tf.Session(config=tf.ConfigProto(
intra_op_parallelism_threads=NUM_THREADS))