By default, does TensorFlow use GPU/CPU simultaneously for computing or GPU only? - tensorflow

By default, TensorFlow will use our available GPU devices. That said, does TensorFlow use GPUs and CPUs simultaneously for computing, or GPUs for computing and CPUs for job handling (no matter how, CPUs are always active, as I think)?

Generally it uses both, the CPU and the GPU (assuming you are using a GPU-enabled TensorFlow). What actually gets used depends on the actual operations that your code is using.
For each operation available in TensorFlow, there are several "implementations" of such operation, generally a CPU implementation and a GPU one. Some operations only have CPU implementations as it makes no sense for a GPU implementation, but overall most operations are available for both devices.
If you make custom operations then you need to provide implementations that you want.

TensorFlow operations come packaged with a list of devices they can execute on and a list of associated priorities.
For example, a convolution is very conducive to computation on a GPU; but can still be done on a CPU whereas scalar additions should definitely be done on a CPU. You can override this selection using tf.Device and the key attached to the device of interest.

Someone correct me if I'm wrong.
But from what I'm aware TensorFlow only uses either GPU or CPU depending on what installation you ran. For example if you used pip install TensorFlow for python 2 or python3 -m pip install TensorFlow for python 3 you'll only use the CPU version.
Vise versa for GPU.
If you still have any questions or if this did not correctly answer your question feel free to ask me more.

Related

Since TensorflowJS can use the GPU via WebGL, why would I need an nVIDIA GPU?

So TensorFlowJS can use WebGL to do GPU computations and train deep learning models. Why isn't this more popular than using CUDA with an nVIDIA GPU? Most people just trying to prototype machine learning models would love to do so on their personal computer, but many of us resort to using expensive cloud services like AWS (although more recently Google Colab helps) for ML training if we don't have a computer with an nVIDIA GPU. I'm sure nVIDIA GPUs are faster than whatever GPU is in my Macbook, but probably any GPU will offer at least an order of magnitude speedup over even a fast CPU and allow for model prototyping, so why aren't well using WebGL GPGPU? There must be a catch I just don't know about.
WebGL backend uses GLSL language to define functions and upload data as shaders - it "works", but you pay huge cost to compile GSLS and upload shaders: warmup time for semi-complex models is immense (we're talking about minutes just to startup). And then memory overhead is 100-200% of what model would normally need - and for larger models, you're GPU memory bound, you don't want to waste that.
Btw, actual inference time once model is warmed up and it fits in memory is ok using WebGL
On the other hand nVidia CUDA libraries provide direct access to GPU, so TF compiled to use them is always going to be much more efficient.
Unfortunately, not many GPU vendors provide libraries like CUDA, so most ML is done on nVidia GPUs
Then there is a next level when you're using TPU instead of GPU - then there is no WebGL to start with
If I select WebGPU with the TFJS benchmark (https://tensorflow.github.io/tfjs/e2e/benchmarks/local-benchmark/index.html) it responds with "WebGPU is not supported. Please use Chrome Canary browser with flag "--enable-unsafe-webgpu" enabled...."
So when that's ready will it be competitive with CUDA? On my laptop it is about 15% faster than WebGL on that benchmark.

CUDA programming: Is occupancy the way to achieve GPU slicing among different process?

There are ways through which GPU sharing can be achieved. I came across occupancy. Can I use it to slice the GPU among the processes (e.g. tensorflow) which are sharing GPU? slice here means GPU resources are always dedicated to that process. Using occupancy I will get know GPU & SMs details and based on that I launch kernel stating that create blocks for these GPU resources.
I am using NVIDIA Corporation GK210GL [Tesla K80] with cuda 9 toolkit installed
Please suggest. Thanks!
There are ways through which GPU sharing can be achieved.
No there aren't. In general, there is no such thing as the type of GPU sharing that you envisage. There is the MPS server for MPI style multi process computing, but that is irrelevant in the context of running Tensorflow (see here for why MPS can't be used).
I came across occupancy. Can I use it to slice the GPU among the processes (e.g. tensorflow) which are sharing GPU?
No you can't. Occupancy is a performance metric. It has nothing whatsoever to do with the ability to share a GPUs resources amongst different processes,
Please suggest
Buy a second GPU.

CUDA-like optimization on Tensorflow-GPU

I am trying to implement a neural network architecture (Self Organizing Maps) for execution on GPUs. I am exploring TensorFlow for this task.
In TensorFlow, I noticed that you just have to specify gpu as the device to execute something on the gpu like in this post. It seems that the way the operations are parallelized is decided by TF and the user does not have options to take optimization decisions. The "Optimizing for GPU" section on TensorFlow Performance Guide also does not talk about explicit control over parallelizing operations.
My question is, can I do CUDA-like optimization in TensorFlow? More elaborately, is it possible to define which operation will be parallelized (like defining CUDA kernels for parallel operations)?
Yes, but you probably don't want to.
At the most extreme you can define your own op (as described here: https://www.tensorflow.org/extend/adding_an_op).
You can implement it as a GPU Kernel and write whatever you want.
You probably don't want to. The default operations are likely well optimized. I doubt you would be able to squeeze anything out significant out of them.
You can decide the device placement for each individual operation (by using tf.device), but you will incur data transfer overhead every time you switch. This should cover the cases where there's some operation that it slow to execute on the GPU.
If you want to process part of the data on CPU and part on the GPU you can slice your data and do 2 operations (one on CPU and one on GPU).
By default, in TF, in graph mode (not in eager mode), everything, all the TF ops run in parallel. There is a thread pool for that, and its size is controlled via inter_op_parallelism_threads. (See also.)
That does not necessarily mean that e.g. multiple matmul will really run in parallel, if they are internally synchronized. That is the case for most CUDA ops, as there is only a single CUDA stream. See here.

Does tensorflow automatically detect GPU or do I have to specify it manually?

I have a code written in tensorflow that I run on CPUs and it runs fine.
I am transferring to a new machine which has GPUs and I run the code on the new machine but the training speed did not improve as expected (takes almost the same time).
I understood that Tensorflow automatically detects GPUs and run the operations on them (https://www.quora.com/How-do-I-automatically-put-all-my-computation-in-a-GPU-in-TensorFlow) & (https://www.tensorflow.org/tutorials/using_gpu).
Do I have to change the code to make it manually runs the operations on GPUs (for now I have a single GPU)? and what would be gained by doing that manually?
Thanks
If the GPU version of TensorFlow is installed and if you don't assign all your tensors to CPU, some of them should be assigned to GPU.
To find out which devices (CPU, GPU) are available to TensorFlow, you can use this:
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())
Regarding the question of the performance, it's quite a broad subject and it really depends of your model, your data and so on. Here are a few and wide remarks on TensorFlow performance.

Strategies for improving performance when using Tensorflow w / C++?

I'm fairly new to Tensorflow in and ML in general and am wondering what strategies I can use to increase performance of an application I am building.
My app is using the Tensorflow C++ interface, with a source compiled TF 0.11 libtensorflow_cc.so (built with bazel build -c opt --copt=-mavx and optionally adding --config=cuda) for either AVX or AVX + CUDA on Mac OS X 10.12.1, on an MacBook Pro 2.8 GHz Intel Core i7 (2 cores 8 threads) with 16GB ram and a Nvidia 750m w/ 2GB VRam)
My application is using Inception V3 model and pulling feature vectors from pool_3 layer. I'm decoding video frames via native API's and passing those in memory buffers to the C++ interface for TF and running them into a session.
I'm not currently batching, but I am caching my session and re-using it for each individual decoded frame / tensor submission. Ive noticed that both CPU and GPU performance is about the same, taking about 40 to 50 seconds to process 222 frames, which seems very slow to me. Ive confirmed CUDA is being invoked, loaded, and the GPU is functioning (or appears so).
Some questions:
In general what should I expect for reasonable performance time wise of TF doing a frame of Inception on a consumer laptop?
How much of a difference does batching make for these operations? For tensors of 1x299x299x3 , I imagine I am doing more PCI transfer waiting than waiting on for meaningful work from the GPU?
if so Is there a good example of batching under C++ for InceptionV3?
Is there operations that cause additional CPU->GPU Syncronization that might otherwise be avoided?
Is there a way to ensure my sessions / graphs share resources ? Can I use nested scopes somehow in this manner? I couldn't quite get that to work but likely missed something.
Any good documentation of general strategies for things to do / avoid?
My code is below:
https://github.com/Synopsis/Synopsis/blob/TensorFlow/Synopsis/TensorFlowAnalyzer/TensorFlowAnalyzer.mm
Thank you very much
For reference, OpenCV analysis using perceptual hash, histogram, dense optical flow, sparse optical flow for point tracking, and simple saliency detection takes 4 to 5 seconds for the same 222 frames using CPU or CPU + OpenCL.
https://github.com/Synopsis/Synopsis/tree/TensorFlow/Synopsis/StandardAnalyzer
Answering your last question first, if there's documentation about performance optimization, yes:
The TensorFlow Performance Guide
The TensorFlow GPU profiling hints
Laptop performance is highly variable, and TF isn't particularly optimized for laptop GPUs. The numbers you're getting (222 frames in 40-50 seconds) ~= 5 fps don't seem crazy on a laptop platform, using the 2016 version of TensorFlow, with inception. With some of the performance improvements outlined in the performance guide above, that should probably be doubled in late 2017.
For batching, yes - the newer example inception model code allows a variable batch size at inference time. This is mostly about whether the model itself was defined to handle a batch size, which is something improved since 2016.
Batching for inference will make a pretty big difference on GPU. Whether it helps on CPU depends a lot -- for example, if you build with MKL-DNN support, batching should be considered mandatory, but basic TensorFlow may not benefit as much.