Numpy, BLAS and CUBLAS - numpy

Numpy can be "linked/compiled" against different BLAS implementations (MKL, ACML, ATLAS, GotoBlas, etc). That's not always straightforward to configure but it is possible.
Is it also possible to "link/compile" numpy against NVIDIA's CUBLAS implementation?
I couldn't find any resources in the web and before I spend too much time trying it I wanted to make sure that it possible at all.

In a word: no, you can't do that.
There is a rather good scikit which provides access to CUBLAS from scipy called scikits.cuda which is built on top of PyCUDA. PyCUDA provides a numpy.ndarray like class which seamlessly allows manipulation of numpy arrays in GPU memory with CUDA. So you can use CUBLAS and CUDA with numpy, but you can't just link against CUBLAS and expect it to work.
There is also a commercial library that provides numpy and cublas like functionality and which has a Python interface or bindings, but I will leave it to one of their shills to fill you in on that.

here is another possibility :
http://www.cs.toronto.edu/~tijmen/gnumpy.html
this is basically a gnumpy + cudamat environment which can be used to harness a GPU. also the same code can be run without the gpu using npmat. refer to the link above to download all these files.

Related

How do TensorFlow C++ operator implementations interface with cuDNN (using conv2d as an example)?

I'm trying to trace how TensorFlow actually uses cuDNN to implement different operators. I'll use Tensorflow's conv2d operator as an example (tf.nn.conv2d).
As a reference to give me an idea of what I should be looking for, as well as how to implement a conv2d operator in cuDNN, I've been reading this blog post: Peter Goldsborough - Convolutions with cuDNN.
So based on this answer (ANSWER: Tensorflow: Where is tf.nn.conv2d Actually Executed?), Tensorflow will (roughly, I recognize there are some other branches that could be taken) call down this stack:
tensorflow/python/ops/nn_ops.py:conv2d_v2
tensorflow/python/ops/nn_ops.py:conv2d
gen_nn_ops.py:conv2d
This file is generated when TF is built (see ANSWER: looking for source code of from gen_nn_ops in tensorflow)
Then we call down into the C++ layer...
...and we end up in the Conv2DOP class (tensorflow/core/kernels/conv_ops.cc:Conv2DOP)
Now I assume (and someone please correct me if I'm wrong), that if we are correctly using TF with cuDNN, we will then be launching a LaunchConv2DOp<GPUDevice, T>::operator().
Towards the end of this operator implementation, around when they start defining a se::dnn::BatchDescriptor (see here), and later when they run LaunchAutotunedConv (see here), this is when I think they are basically making use of their higher abstraction levels, but eventually down these levels they interface with the cuDNN APIs.
Now I expected to find some sort of communication here between, for example, se::dnn::BatchDescriptor or LaunchAutotunedConv and either the cuDNN specific methods found in tensorflow/stream_executor/cuda/cuda_dnn.cc, or any of the auto-generated stub files that are used to wrap cuDNN APIs based on the cuDNN version (e.g., tensorflow/stream_executor/cuda/cudnn_8_0.inc. However, I can find no link between these 2 levels of abstraction.
Am I missing something? At what point does Tensorflow actually make calls to the cuDNN APIs from their C++ operator implementations?

How does Ruy, XNNPACK, and Eigen work in Tensorflow Lite?

I heard from various sources (mostly from the official documents) that Tensorflow Lite (for
ARM) uses these three libraries - Ruy, Eigen, XNNPACK - for its operation.
I understand they somehow accelerate the computation (mainly convolution) in TF Lite, but I'm not exactly sure what purpose each library serves. I know Eigen is a BLAS library, but I'm not sure what others are and how they are related to each other in TF Lite.
Would someone care to explain what different purposes they serve and how they are used in conjunction in TF Lite? (Call Stacks maybe?)
I've been looking around the official documentations of each libraries but I was unable to find much details for Ruy and XNNPACK. Ruy says that it provides efficient matrix multiplication, but isn't that what BLAS libraries do?
Older version of TensorFlow Lite used Eigen and Gemmlowp library to accelerate the computation. However on Arm platforms the performance was worst compared to e.g. Arm Compute Library.
TensorFlow Lite replaced the Eigen and Gemmlowp around version 2.3 and with Ruy matrix multiplication library. They serves similar purpose, but Ruy performs better. The Ruy is default on Arm platform, but you can still compile the TensorFlow Lite without use of Ruy.
XNNPACK outperforms Ruy even more, but it focus solely on operation on float.
Regarding Ruy performance benchmarks check this thread https://github.com/google/ruy/issues/195, and the benchmarks on Pixel4 https://docs.google.com/spreadsheets/d/1CB4gsI7pujNRAf5Iz5vuD783QQqO2zOu8up9IpTKdlU/edit#gid=510573209.

Minimal (light version) PyTorch and Numpy packages in production

I am putting a model into production and I am required to scan all dependencies (Pytorch and Numpy) beforehand via VeraCode Scan.
I noticed that the majority of the flaws are coming from test scripts and caffe2 modules in Pytorch and numpy.
Is there any way to build/install only part of these packages that I use in my application? (e.g. I won't use testing and caffe2 in the application so there's no need to have them in my PyTorch / Numpy source code)
1. PyInstaller
You could package your application using pyinstaller. This tool packages your app with Python and dependencies and use only the parts you need (simplifying, in reality it's hard to trace your package exactly so some other stuff would be bundled as well).
Also you might be in for some quirks and workarounds to make it work with pytorch and numpy as those dependencies are quite heavy (especially pytorch).
2. Use only PyTorch
numpy and pytorch are pretty similar feature-wise (as PyTorch tries to be compatible with it) hence maybe you could only use only of them which would simplify the whole thing further
3. Use C++
Depending on other parts of your app you may write it (at least neural network) in C++ using PyTorch's C++ frontend which is stable since 1.5.0 release.
Going this route would allow you to compile PyTorch's .cpp source code statically (so all dependencies are linked) which allows you for relatively small binary size (30Mb when compared to PyTorch's 1GB+), but requires a lot of work.

Is GEMM or BLAS used in Tensorflow, Theano, Pytorch

I know that Caffe uses GEneral Matrix to Matrix Multiplication (GEMM) which is part of Basic Linear Algebra Subprograms (BLAS) library for performing convolution operations. Where a convolution is converted to matrix multiplication operation. I have referred below article. https://petewarden.com/2015/04/20/why-gemm-is-at-the-heart-of-deep-learning/
I want to understand how other deep learning frameworks like Theano, Tensorflow, Pytorch perform convolution operations. Do they use similar libraries in the backend. There might be some articles present on this topic. If someone can point me to those or can explain with an answer.
PS: I posted the same question on datascience.stackexchange.com. As I didn't get a reply there, I am posting it here as well. If there is a better forum to post this question please let me know.
tensorflow has multiple alternatives for the operations.
for GPU, cuda support is used. Most of the operations are implemented with cuDNN, some use cuBLAS, and others use cuda.
You can also use openCL instead of cuda, but you should compile tensorflow by yourself.
for CPU, intel mkl is used as the blas library.
I'm not familiar with pytorch and theano, but some commonly used blas libraries are listed below:
cuDNN, cuBLAS, and cuda: nvidia GPU support, most popular library
openCL: common GPU support, I don't know about it at all.
MKL: CPU blas library provided by intel
openBLAS: CPU library

how much of sklearn can I use with pypy?

The pypy project is currently adding support for numpy.
My impression is that sklearn library is mainly based on numpy.
Would I be able to use most of this library or there are other requirements that are not supported yet?
Officially, none of it. If you want to do a port, go ahead (and please report results on the mailing list), but PyPy is simply not supported because scikit-learn uses many, many parts of NumPy and SciPy as well as having a lot of C, C++ and Cython extension code.
The official website of sklearn (https://scikit-learn.org/stable/faq.html), see here:
Do you support PyPy?
In case you didn’t know, PyPy is an alternative Python implementation with a built-in just-in-time compiler. Experimental support for PyPy3-v5.10+ has been added, which requires Numpy 1.14.0+, and scipy 1.1.0+.
Also see what pypy has to say (https://www.pypy.org/)
Compatibility: PyPy is highly compatible with existing python code. It supports cffi, cppyy, and can run popular python libraries like twisted, and django. It can also run NumPy, Scikit-learn and more via a c-extension compatibility layer.