What are the fastest available implementations of BLAS/LAPACK or other linear algebra routings on GPU systems? - gpu

nVidia, for example, has CUBLAS, which promises 7-14x speedup. Naively, this is nowhere near the theoretical throughput of any of nVidia's GPU cards. What are the challenges in speeding up linear algebra on GPUs, and are there faster linear algebra routings already available?

As far as I know, CUBLAS is the fastest linear algebra implementation available for Nvidia GPUs. If you require LAPACK functionality, there's CULAPACK.
Note that CUBLAS only covers dense linear algebra; for sparse matrices, there's CUSPARSE (also provided as part of the CUDA toolkit).
The speedup greatly depends on the type of data you're operating on, as well as the specific operation you're performing. Some linear algebra operations parallelize very well, and others don't because they're inherently sequential. Optimization of numerical algorithms for parallel architectures is (and has been, for decades) an ongoing area of research -- so the performance of the algorithms is continually improving.

Related

What is the difference between JAX, Trax, and TensorRT, in simple terms?

I have been using TensorRT and TensorFlow-TRT to accelerate the inference of my DL algorithms.
Then I have heard of:
JAX https://github.com/google/jax
Trax https://github.com/google/trax
Both seem to accelerate DL. But I am having a hard time to understand them. Can anyone explain them in simple terms?
Trax is a deep learning framework created by Google and extensively used by the Google Brain team. It comes as an alternative to TensorFlow and PyTorch when it comes to implementing off-the-shelf state of the art deep learning models, for example Transformers, Bert etc. , in principle with respect to the Natural Language Processing field.
Trax is built upon TensorFlow and JAX. JAX is an enhanced and optimised version of Numpy. The important distinction about JAX and NumPy is that the former using a library called XLA (advanced linear algebra) which allows to run your NumPy code on GPU and TPU rather than on CPU like it happens in the plain NumPy, thus speeding up computation.

What is a fused kernel (or fused layer) in deep learning?

I am reading the Apex AMP documentation:
A Python-only build omits:
Fused kernels required to use apex.optimizers.FusedAdam.
Fused kernels
required to use apex.normalization.FusedLayerNorm.
Fused kernels that
improve the performance and numerical stability of
apex.parallel.SyncBatchNorm.
Fused kernels that improve the
performance of apex.parallel.DistributedDataParallel and apex.amp.
DistributedDataParallel, amp, and SyncBatchNorm will still be usable,
but they may be slower.
There also seems to be a "FusedAdam" optimizer:
The Adam optimizer in Pytorch (like all Pytorch optimizers) carries
out optimizer.step() by looping over parameters, and launching a
series of kernels for each parameter. This can require hundreds of
small launches that are mostly bound by CPU-side Python looping and
kernel launch overhead, resulting in poor device utilization.
Currently, the FusedAdam implementation in Apex flattens the
parameters for the optimization step, then carries out the
optimization step itself via a fused kernel that combines all the Adam
operations. In this way, the loop over parameters as well as the
internal series of Adam operations for each parameter are fused such
that optimizer.step() requires only a few kernel launches.
The current implementation (in Apex master) is brittle and only works
with Amp opt_level O2. I’ve got a WIP branch to make it work for any
opt_level (https://github.com/NVIDIA/apex/pull/351). I recommend
waiting until this is merged then trying it.
This partially explains it. I'm left with more questions:
What is meant by kernel? A layer or an optimizer?
Is the idea of fused layer the same as a fused optimizer?
"Kernel" here is for computation kernels: https://en.wikipedia.org/wiki/Compute_kernel
Operations like convolution are often implemented using compute kernels for better efficiency. Compute kernels can be written using C, CUDA, OpenCL or even assembly for maximum efficiency. It is therefore not surprizing that "a Python-only build" does not support...
"Fusing" means commonalization of computation steps. Basically, it's an implementation trick to run code more efficiently by combining similar operations in a single hardware (GPU, CPU or TPU) operation. Therefore, a "fusedLayer" is a layer where operations benefit from a "fused" implementation.

Why the MobileNetV2 is faster than MobileNetV1 only at mobile device?

I am studying about Google's brandnew MobileNetV2 architecture.
During studying, I've read this string at Tensorflow model zoo Github
'For example Mobilenet V2 is faster on mobile devices than Mobilenet V1, but is slightly slower on desktop GPU.'
So, my question is,
How that could be possible? I really want to know why.
From https://arxiv.org/abs/1903.08469v1 :
"However, MobileNet V2 uses depthwise separable convolutions which are not directly supported in GPU firmware (the cuDNN library). Therefore, MobileNet V2 tends to be slower than ResNet18 in most experimental setups. Note that the same issue disqualifies usage of the DenseNet architecture [12], since it requires efficient convolution over a non-contiguous tensor, which is still not supported in cuDNN."
From their published paper at MobileNetV2: Inverted Residuals and Linear Bottlenecks,
under subtopic number 5: Implementation Notes, 5.1. Memory efficient inference;
The inverted residual bottleneck layers allow a particularly
memory efficient implementation which is very
important for mobile applications. (and more in paper)
According to TensorFlow team, it's optimized smaller in size can also be used as TF Lite. As far as we know TF Lite is indeed for mobile use. It's much slower on desktop GPU probably V2 has more conv layers compared to V1 which make sense if the training tooks more times to finish. For now, we didn't do the training and inferencing of data on mobile because of computational speed hunger which lead to power hunger as well.
Hope I answer the question.

Is everything in Tensorflow implemented as a NN?

For example, Kmeans clustering - is it implemented as a neural network algorithm?
No, why should it ? In order to better understand tensorflow take a look at the original paper in the abstract it states:
TensorFlow [1] is an interface for expressing machine learning
algorithms, and an implementation for executing such algorithms. A
computation expressed using TensorFlow can be executed with little or
no change on a wide variety of heterogeneous systems, ranging from
mobile devices such as phones and tablets up to large-scale
distributed systems of hundreds of machines and thousands of
computational devices such as GPU cards.
Hence Tensorflow is a tool to express algorithms and to schedule them on pieces of hardware such as CPU's, GPU's, TPU's and friends. Because it is most well known for running neural networks doesn't mean that even the simplest things should be implemented by using them.

What's the impact of using a GPU in the performance of serving a TensorFlow model?

I trained a neural network using a GPU (1080 ti). The training speed on GPU is far better than using CPU.
Currently, I want to serve this model using TensorFlow Serving. I just interested to know if using GPU in the serving process has a same impact on performance?
Since the training apply on batches but inferencing (serving) uses asynchronous requests, do you suggest using GPU in serving a model using TensorFlow serving?
You still need to do a lot of tensor operations on the graph to predict something. So GPU still provides performance improvement for inference. Take a look at this nvidia paper, they have not tested their stuff on TF, but it is still relevant:
Our results show that GPUs provide state-of-the-art inference
performance and energy efficiency, making them the platform of choice
for anyone wanting to deploy a trained neural network in the field. In
particular, the Titan X delivers between 5.3 and 6.7 times higher
performance than the 16-core Xeon E5 CPU while achieving 3.6 to 4.4
times higher energy efficiency.
The short answer is yes, you'll get roughly the same speedup for running on the GPU after training. With a few minor qualifications.
You're running 2 passes over the data in training, which all happens on the GPU, during the feedforward inference you're doing less work, so there will be more time spent transferring data to the GPU memory relative to computations than in training. This is probably a minor difference though. And you can now asynchronously load the GPU if that's an issue (https://github.com/tensorflow/tensorflow/issues/7679).
Whether you'll actually need a GPU to do inference depends on your workload. If your workload isn't overly demanding you might get away with using the CPU anyway, after all, the computation workload is less than half, per sample, so consider the number of requests per second you'll need to serve and test out whether you overload your CPU to achieve that. If you do, time to get the GPU out!