Is there any way to fuse fully connected layer(gemm) and activation layer(relu/sigmoid) on gpu in dnn? - tensorflow

Usually one layer in dnn consists of MatMul, BiasAdd, Relu, cuBlas provides Gemm for MatMul, and we can do BiasAdd and Relu in another kernel for GPU. They are two GPU lanuch calls, is there any way to fuse them all togather and make them just one? I looked into cuBlas, cudnn, but not found anything. I think it's not difficult because BiasAdd and Relu are just element-wise operaions, and fusion makes it more efficient.
Here is the backgroud:
I am working on a online prediction service which is multi dnn model ensemble. By profiling my program, I found out that both my CPU and GPU is not fully utilized, but requests blocks on GPU-related function call (like lanuchKernel). It seems like there's a big lock in libcuda. I am using tensorflow, XLA enabled, so I use nvprof and tensorflow HLO to visialize GPU-call, and there'are only dot and fused(which is biasadd and relu) operations. Although kernel fusion is done, there're still too many lanuchKernel calls, and GPU utilization is only 60%. I tried multi cuda context in one process, the improvement is trivial.
By the way, I am using one single GPU, Tesla P100.

Related

Assign Keras/TF/PyTorch layer to hardware type

Suppose we have the following architecture:
Multiple CNN layers
RNN layer
(Time-distributed) Dense classification layer
We want to train this architecture now. Our fancy GPU is very fast at solving the CNN layers. Although using a lower clockrate, it can perform many convolutions in parallel, thus the speed. Our fancy CPU however is faster for the (very long) resulting time series, because the time steps cannot be parallelized, and the processing profits from the higher CPU clockrate. So the (supposedly) smart idea for execution would look like this:
Multiple CNN layers (run on GPU)
RNN layer (run on CPU)
(Time-distributed) Dense classification layer (run on GPU/CPU)
This lead me to two important questions:
Is it possible, with any of the frameworks mentioned in the title, to distribute certain layers to certain hardware, and how?
If it is possible, would the overhead for the additional memory operations, e.g. tranferring between GPU-/CPU-RAM, render the whole idea useless?
Basically, in Pytorch you can control the device on which variables/parameters reside. AFAIK, it is your responsibility to make sure that for each operation all the arguments reside on the same device: i.e., you cannot conv(x, y) where x is on GPU and y is on CPU.
This is done via pytorch's .to() method that moves a module/variable .to('cpu') or .to('cuda:0')
As Shai mentioned you can control this yourself in pytorch so in theory you can have parts of your model on different devices. You have then to move data between devices in your forward pass.
I think the overhead as you mentioned would make the performance worst. The cuda RNN implementation benefits greatly from running on a gpu anyways :)

What is a fused kernel (or fused layer) in deep learning?

I am reading the Apex AMP documentation:
A Python-only build omits:
Fused kernels required to use apex.optimizers.FusedAdam.
Fused kernels
required to use apex.normalization.FusedLayerNorm.
Fused kernels that
improve the performance and numerical stability of
apex.parallel.SyncBatchNorm.
Fused kernels that improve the
performance of apex.parallel.DistributedDataParallel and apex.amp.
DistributedDataParallel, amp, and SyncBatchNorm will still be usable,
but they may be slower.
There also seems to be a "FusedAdam" optimizer:
The Adam optimizer in Pytorch (like all Pytorch optimizers) carries
out optimizer.step() by looping over parameters, and launching a
series of kernels for each parameter. This can require hundreds of
small launches that are mostly bound by CPU-side Python looping and
kernel launch overhead, resulting in poor device utilization.
Currently, the FusedAdam implementation in Apex flattens the
parameters for the optimization step, then carries out the
optimization step itself via a fused kernel that combines all the Adam
operations. In this way, the loop over parameters as well as the
internal series of Adam operations for each parameter are fused such
that optimizer.step() requires only a few kernel launches.
The current implementation (in Apex master) is brittle and only works
with Amp opt_level O2. I’ve got a WIP branch to make it work for any
opt_level (https://github.com/NVIDIA/apex/pull/351). I recommend
waiting until this is merged then trying it.
This partially explains it. I'm left with more questions:
What is meant by kernel? A layer or an optimizer?
Is the idea of fused layer the same as a fused optimizer?
"Kernel" here is for computation kernels: https://en.wikipedia.org/wiki/Compute_kernel
Operations like convolution are often implemented using compute kernels for better efficiency. Compute kernels can be written using C, CUDA, OpenCL or even assembly for maximum efficiency. It is therefore not surprizing that "a Python-only build" does not support...
"Fusing" means commonalization of computation steps. Basically, it's an implementation trick to run code more efficiently by combining similar operations in a single hardware (GPU, CPU or TPU) operation. Therefore, a "fusedLayer" is a layer where operations benefit from a "fused" implementation.

By default, does TensorFlow use GPU/CPU simultaneously for computing or GPU only?

By default, TensorFlow will use our available GPU devices. That said, does TensorFlow use GPUs and CPUs simultaneously for computing, or GPUs for computing and CPUs for job handling (no matter how, CPUs are always active, as I think)?
Generally it uses both, the CPU and the GPU (assuming you are using a GPU-enabled TensorFlow). What actually gets used depends on the actual operations that your code is using.
For each operation available in TensorFlow, there are several "implementations" of such operation, generally a CPU implementation and a GPU one. Some operations only have CPU implementations as it makes no sense for a GPU implementation, but overall most operations are available for both devices.
If you make custom operations then you need to provide implementations that you want.
TensorFlow operations come packaged with a list of devices they can execute on and a list of associated priorities.
For example, a convolution is very conducive to computation on a GPU; but can still be done on a CPU whereas scalar additions should definitely be done on a CPU. You can override this selection using tf.Device and the key attached to the device of interest.
Someone correct me if I'm wrong.
But from what I'm aware TensorFlow only uses either GPU or CPU depending on what installation you ran. For example if you used pip install TensorFlow for python 2 or python3 -m pip install TensorFlow for python 3 you'll only use the CPU version.
Vise versa for GPU.
If you still have any questions or if this did not correctly answer your question feel free to ask me more.

CUDA-like optimization on Tensorflow-GPU

I am trying to implement a neural network architecture (Self Organizing Maps) for execution on GPUs. I am exploring TensorFlow for this task.
In TensorFlow, I noticed that you just have to specify gpu as the device to execute something on the gpu like in this post. It seems that the way the operations are parallelized is decided by TF and the user does not have options to take optimization decisions. The "Optimizing for GPU" section on TensorFlow Performance Guide also does not talk about explicit control over parallelizing operations.
My question is, can I do CUDA-like optimization in TensorFlow? More elaborately, is it possible to define which operation will be parallelized (like defining CUDA kernels for parallel operations)?
Yes, but you probably don't want to.
At the most extreme you can define your own op (as described here: https://www.tensorflow.org/extend/adding_an_op).
You can implement it as a GPU Kernel and write whatever you want.
You probably don't want to. The default operations are likely well optimized. I doubt you would be able to squeeze anything out significant out of them.
You can decide the device placement for each individual operation (by using tf.device), but you will incur data transfer overhead every time you switch. This should cover the cases where there's some operation that it slow to execute on the GPU.
If you want to process part of the data on CPU and part on the GPU you can slice your data and do 2 operations (one on CPU and one on GPU).
By default, in TF, in graph mode (not in eager mode), everything, all the TF ops run in parallel. There is a thread pool for that, and its size is controlled via inter_op_parallelism_threads. (See also.)
That does not necessarily mean that e.g. multiple matmul will really run in parallel, if they are internally synchronized. That is the case for most CUDA ops, as there is only a single CUDA stream. See here.

What's the impact of using a GPU in the performance of serving a TensorFlow model?

I trained a neural network using a GPU (1080 ti). The training speed on GPU is far better than using CPU.
Currently, I want to serve this model using TensorFlow Serving. I just interested to know if using GPU in the serving process has a same impact on performance?
Since the training apply on batches but inferencing (serving) uses asynchronous requests, do you suggest using GPU in serving a model using TensorFlow serving?
You still need to do a lot of tensor operations on the graph to predict something. So GPU still provides performance improvement for inference. Take a look at this nvidia paper, they have not tested their stuff on TF, but it is still relevant:
Our results show that GPUs provide state-of-the-art inference
performance and energy efficiency, making them the platform of choice
for anyone wanting to deploy a trained neural network in the field. In
particular, the Titan X delivers between 5.3 and 6.7 times higher
performance than the 16-core Xeon E5 CPU while achieving 3.6 to 4.4
times higher energy efficiency.
The short answer is yes, you'll get roughly the same speedup for running on the GPU after training. With a few minor qualifications.
You're running 2 passes over the data in training, which all happens on the GPU, during the feedforward inference you're doing less work, so there will be more time spent transferring data to the GPU memory relative to computations than in training. This is probably a minor difference though. And you can now asynchronously load the GPU if that's an issue (https://github.com/tensorflow/tensorflow/issues/7679).
Whether you'll actually need a GPU to do inference depends on your workload. If your workload isn't overly demanding you might get away with using the CPU anyway, after all, the computation workload is less than half, per sample, so consider the number of requests per second you'll need to serve and test out whether you overload your CPU to achieve that. If you do, time to get the GPU out!