Is GEMM or BLAS used in Tensorflow, Theano, Pytorch - tensorflow

I know that Caffe uses GEneral Matrix to Matrix Multiplication (GEMM) which is part of Basic Linear Algebra Subprograms (BLAS) library for performing convolution operations. Where a convolution is converted to matrix multiplication operation. I have referred below article. https://petewarden.com/2015/04/20/why-gemm-is-at-the-heart-of-deep-learning/
I want to understand how other deep learning frameworks like Theano, Tensorflow, Pytorch perform convolution operations. Do they use similar libraries in the backend. There might be some articles present on this topic. If someone can point me to those or can explain with an answer.
PS: I posted the same question on datascience.stackexchange.com. As I didn't get a reply there, I am posting it here as well. If there is a better forum to post this question please let me know.

tensorflow has multiple alternatives for the operations.
for GPU, cuda support is used. Most of the operations are implemented with cuDNN, some use cuBLAS, and others use cuda.
You can also use openCL instead of cuda, but you should compile tensorflow by yourself.
for CPU, intel mkl is used as the blas library.
I'm not familiar with pytorch and theano, but some commonly used blas libraries are listed below:
cuDNN, cuBLAS, and cuda: nvidia GPU support, most popular library
openCL: common GPU support, I don't know about it at all.
MKL: CPU blas library provided by intel
openBLAS: CPU library

Related

Tensorflow profiling for non-model computations

I have a computation which has for loops and calls to Tensorflow matrix algorithms such as tf.lstsq and Tensorflow iteration with tf.map_fn. I would like to profile this to see how much parallelism I am getting in the tf.map_fn and matrix algorithms that get called.
This doesn't seem to be the use case at all for the Tensorflow Profiler which is organized around the neural network model training loop.
Is there a way to use Tensorflow Profiler for arbitrary Tensorflow computations, or is the go-to move in this case to use NVidia tools like nvprof?
I figured out that the nvprof and nvvp and nsight tools I was looking for are available as a Conda install of cudatoolkit-dev. Uses are described in this gist.

How does Ruy, XNNPACK, and Eigen work in Tensorflow Lite?

I heard from various sources (mostly from the official documents) that Tensorflow Lite (for
ARM) uses these three libraries - Ruy, Eigen, XNNPACK - for its operation.
I understand they somehow accelerate the computation (mainly convolution) in TF Lite, but I'm not exactly sure what purpose each library serves. I know Eigen is a BLAS library, but I'm not sure what others are and how they are related to each other in TF Lite.
Would someone care to explain what different purposes they serve and how they are used in conjunction in TF Lite? (Call Stacks maybe?)
I've been looking around the official documentations of each libraries but I was unable to find much details for Ruy and XNNPACK. Ruy says that it provides efficient matrix multiplication, but isn't that what BLAS libraries do?
Older version of TensorFlow Lite used Eigen and Gemmlowp library to accelerate the computation. However on Arm platforms the performance was worst compared to e.g. Arm Compute Library.
TensorFlow Lite replaced the Eigen and Gemmlowp around version 2.3 and with Ruy matrix multiplication library. They serves similar purpose, but Ruy performs better. The Ruy is default on Arm platform, but you can still compile the TensorFlow Lite without use of Ruy.
XNNPACK outperforms Ruy even more, but it focus solely on operation on float.
Regarding Ruy performance benchmarks check this thread https://github.com/google/ruy/issues/195, and the benchmarks on Pixel4 https://docs.google.com/spreadsheets/d/1CB4gsI7pujNRAf5Iz5vuD783QQqO2zOu8up9IpTKdlU/edit#gid=510573209.

GPU support for TensorFlow & PyTorch

Okay, so I've worked on a bunch of Deep Learning projects and internships now and I've never had to do heavy training. But lately I've been thinking of doing some Transfer Learning for which I'll need to run my code on a GPU. Now I have a system with Windows 10 and a dedicated NVIDIA GeForce 940M GPU. I've been doing a lot of research online, but I'm still a bit confused. I haven't installed the NVIDIA Cuda Toolkit or cuDNN or tensorflow-gpu on my system yet. I currently use tensorflow and pytorch to train my DL models. Here are my queries -
When I define a tensor in tf or pytorch, it is a cpu tensor by default. So, all the training I've been doing so far has been on the CPU. So, if I make sure to install the correct versions of Cuda and cuDNN and tensorflow-gpu (specifically for tensorflow), I can run my models on my GPU using tf-gpu and pytorch and that's it? (I'm aware of the torch.cuda.is_available() in pytorch to ensure pytorch can access my GPU and the device_lib module in tf to check if my gpu is visible to tensorflow)(I'm also aware of the fact that tf doesnt support all Nvidia GPUs)
Why does tf have a separate module for GPU support? PyTorch doesnt seem to have that and all you need to do is cast your tensor from cpu() to cuda() to switch between them.
Why install cuDNN? I know it is a high-level API CUDA built for support to train Deep Neural Nets on the GPU. But do tf-gpu and torch use these in the backend while training on the gpu?
After tf == 1.15, did they combine CPU and GPU support all into one package?
First of all unfortunately 940M is a kinda weak GPU for training. I suggest you use Google colab for faster training but of course, it would be faster than the CPU. So here my answers to your four questions.
1-) Yes if you install the requirements correctly, then you can run on GPU. You can manually place your data to your GPU as well. You can check implementations on TensorFlow. In PyTorch, you should specify the device that you want to use. As you said you should do device = torch.device("cuda" if args.cuda else "cpu") then for models and data you should always call .to(device) Then it will automatically use GPU if available.
2-) PyTorch also needs extra installation (module) for GPU support. However, with recent updates both TF and PyTorch are easy to use for GPU compatible code.
3-) Both Tensorflow and PyTorch is based on cuDNN. You can use them without cuDNN but as far as I know, it hurts the performance but I'm not sure about this topic.
4-) No they are still different packages. tensorflow-gpu==1.15 and tensorflow==1.15 what they did with tf2, was making the tensorflow more like Keras. So it is more simplified then 1.15 or before.
Rest was already answered by regarding 3) cudNN optimizes layer and such operations on hardware level and those implementations are pure black magic. It is incredibly hard to write CUDA code that properly utilizes your GPU (how load data into the GPU, how to actually perform them using matrices etc. )

What is the difference between JAX, Trax, and TensorRT, in simple terms?

I have been using TensorRT and TensorFlow-TRT to accelerate the inference of my DL algorithms.
Then I have heard of:
JAX https://github.com/google/jax
Trax https://github.com/google/trax
Both seem to accelerate DL. But I am having a hard time to understand them. Can anyone explain them in simple terms?
Trax is a deep learning framework created by Google and extensively used by the Google Brain team. It comes as an alternative to TensorFlow and PyTorch when it comes to implementing off-the-shelf state of the art deep learning models, for example Transformers, Bert etc. , in principle with respect to the Natural Language Processing field.
Trax is built upon TensorFlow and JAX. JAX is an enhanced and optimised version of Numpy. The important distinction about JAX and NumPy is that the former using a library called XLA (advanced linear algebra) which allows to run your NumPy code on GPU and TPU rather than on CPU like it happens in the plain NumPy, thus speeding up computation.

Does NNpack are implemented in tensorflow?

Does NNpack are implemented in tensorflow? I find that it can be to be leveraged by higher-level frameworks, such as Torch, Caffe, Tensorflow, Theano, and Mocha.jl, Howerver, I have not found any implementation in tensorflow.
TensorFlow does not currently use NNPACK to accelerate its math kernels; instead it uses a combination of Eigen::Tensor, cuDNN, cuBLAS, XLA-generated, and hand-written kernels.
If you wanted to use NNPACK inside a TensorFlow op, you could do this by adding a custom op kernel that used that library.