How does Ruy, XNNPACK, and Eigen work in Tensorflow Lite? - tensorflow

I heard from various sources (mostly from the official documents) that Tensorflow Lite (for
ARM) uses these three libraries - Ruy, Eigen, XNNPACK - for its operation.
I understand they somehow accelerate the computation (mainly convolution) in TF Lite, but I'm not exactly sure what purpose each library serves. I know Eigen is a BLAS library, but I'm not sure what others are and how they are related to each other in TF Lite.
Would someone care to explain what different purposes they serve and how they are used in conjunction in TF Lite? (Call Stacks maybe?)
I've been looking around the official documentations of each libraries but I was unable to find much details for Ruy and XNNPACK. Ruy says that it provides efficient matrix multiplication, but isn't that what BLAS libraries do?

Older version of TensorFlow Lite used Eigen and Gemmlowp library to accelerate the computation. However on Arm platforms the performance was worst compared to e.g. Arm Compute Library.
TensorFlow Lite replaced the Eigen and Gemmlowp around version 2.3 and with Ruy matrix multiplication library. They serves similar purpose, but Ruy performs better. The Ruy is default on Arm platform, but you can still compile the TensorFlow Lite without use of Ruy.
XNNPACK outperforms Ruy even more, but it focus solely on operation on float.
Regarding Ruy performance benchmarks check this thread https://github.com/google/ruy/issues/195, and the benchmarks on Pixel4 https://docs.google.com/spreadsheets/d/1CB4gsI7pujNRAf5Iz5vuD783QQqO2zOu8up9IpTKdlU/edit#gid=510573209.

Related

Minimal (light version) PyTorch and Numpy packages in production

I am putting a model into production and I am required to scan all dependencies (Pytorch and Numpy) beforehand via VeraCode Scan.
I noticed that the majority of the flaws are coming from test scripts and caffe2 modules in Pytorch and numpy.
Is there any way to build/install only part of these packages that I use in my application? (e.g. I won't use testing and caffe2 in the application so there's no need to have them in my PyTorch / Numpy source code)
1. PyInstaller
You could package your application using pyinstaller. This tool packages your app with Python and dependencies and use only the parts you need (simplifying, in reality it's hard to trace your package exactly so some other stuff would be bundled as well).
Also you might be in for some quirks and workarounds to make it work with pytorch and numpy as those dependencies are quite heavy (especially pytorch).
2. Use only PyTorch
numpy and pytorch are pretty similar feature-wise (as PyTorch tries to be compatible with it) hence maybe you could only use only of them which would simplify the whole thing further
3. Use C++
Depending on other parts of your app you may write it (at least neural network) in C++ using PyTorch's C++ frontend which is stable since 1.5.0 release.
Going this route would allow you to compile PyTorch's .cpp source code statically (so all dependencies are linked) which allows you for relatively small binary size (30Mb when compared to PyTorch's 1GB+), but requires a lot of work.

What is the difference between JAX, Trax, and TensorRT, in simple terms?

I have been using TensorRT and TensorFlow-TRT to accelerate the inference of my DL algorithms.
Then I have heard of:
JAX https://github.com/google/jax
Trax https://github.com/google/trax
Both seem to accelerate DL. But I am having a hard time to understand them. Can anyone explain them in simple terms?
Trax is a deep learning framework created by Google and extensively used by the Google Brain team. It comes as an alternative to TensorFlow and PyTorch when it comes to implementing off-the-shelf state of the art deep learning models, for example Transformers, Bert etc. , in principle with respect to the Natural Language Processing field.
Trax is built upon TensorFlow and JAX. JAX is an enhanced and optimised version of Numpy. The important distinction about JAX and NumPy is that the former using a library called XLA (advanced linear algebra) which allows to run your NumPy code on GPU and TPU rather than on CPU like it happens in the plain NumPy, thus speeding up computation.

Redundancies in tf.keras.backend and tensorflow libraries

I have been working in TensorFlow for about a year now, and I am transitioning from TF 1.x to TF 2.0, and I am looking for some guidance on how to use the tf.keras.backend library in TF 2.0. I understand that the transition to TF 2.0 is supposed to remove a lot of redundancies in modeling and building graphs, since there were many ways to create equivalent layers in earlier TensorFlow versions (and I'm insanely grateful for that change!), but I'm getting stuck on understanding when to use tf.keras.backend, because the operations appear redundant with other TensorFlow libraries.
I see that some of the functions in tf.keras.backend are redundant with other TensorFlow libraries. For instance, tf.keras.backend.abs and tf.math.abs are not aliases (or at least, they're not listed as aliases in the documentation), but both take the absolute value of a tensor. After examining the source code, it looks like tf.keras.backend.abs calls the tf.math.abs function, and so I really do not understand why they are not aliases. Other tf.keras.backend operations don't appear to be duplicated in TensorFlow libraries, but it looks like there are TensorFlow functions that can do equivalent things. For instance, tf.keras.backend.cast_to_floatx can be substituted with tf.dtypes.cast as long as you explicitly specify the dtype. I am wondering two things:
when is it best to use the tf.keras.backend library instead of the equivalent TensorFlow functions?
is there a difference in these functions (and other equivalent tf.keras.backend functions) that I am missing?
Short answer: Prefer tensorflow's native API such as tf.math.* to thetf.keras.backend.* API wherever possible.
Longer answer:
The tf.keras.backend.* API can be mostly viewed as a remnant of the keras.backend.* API. The latter is a design that serves the "exchangeable backend" design of the original (non-TF-specific) keras. This relates to the historical aspect of keras, which supports multiple backend libraries, among which tensorflow used to be just one of them. Back in 2015 and 2016, other backends, such as Theano and MXNet were quite popular too. But going into 2017 and 2018, tensorflow became the dominant backend of keras users. Eventually keras became a part of the tensorflow API (in 2.x and later minor versions of 1.x). In the old multi-backend world, the backend.* API provides a backend-independent abstraction over the myriad of supported backend. But in the tf.keras world, the value of the backend API is much more limited.
The various functions in tf.keras.backend.* can be divided into a few categories:
Thin wrappers around the equivalent or mostly-equivalent tensorflow native API. Examples: tf.keras.backend.less, tf.keras.backend.sin
Slightly thicker wrappers around tensorflow native APIs, with more features included. Examples: tf.keras.backend.batch_normalization, tf.keras.backend.conv2d(https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/keras/backend.py#L4869). They often perform proprocessing and implement other logics, which make your life easier than using native tensorflow API.
Unique functions that don't have equivalent in the native tensorflow API. Examples: tf.keras.backend.rnn, tf.keras.backend.set_learning_phase
For category 1, use native tensorflow APIs. For categories 2 and 3, you may want to use the tf.keras.backend.* API, as long as you can find it in the documentation page: https://www.tensorflow.org/api_docs/python/, because the documented ones have backward compatibility guarantees, so that you don't need to worry about a future version of tensorflow removing it or changing its behavior.

Is GEMM or BLAS used in Tensorflow, Theano, Pytorch

I know that Caffe uses GEneral Matrix to Matrix Multiplication (GEMM) which is part of Basic Linear Algebra Subprograms (BLAS) library for performing convolution operations. Where a convolution is converted to matrix multiplication operation. I have referred below article. https://petewarden.com/2015/04/20/why-gemm-is-at-the-heart-of-deep-learning/
I want to understand how other deep learning frameworks like Theano, Tensorflow, Pytorch perform convolution operations. Do they use similar libraries in the backend. There might be some articles present on this topic. If someone can point me to those or can explain with an answer.
PS: I posted the same question on datascience.stackexchange.com. As I didn't get a reply there, I am posting it here as well. If there is a better forum to post this question please let me know.
tensorflow has multiple alternatives for the operations.
for GPU, cuda support is used. Most of the operations are implemented with cuDNN, some use cuBLAS, and others use cuda.
You can also use openCL instead of cuda, but you should compile tensorflow by yourself.
for CPU, intel mkl is used as the blas library.
I'm not familiar with pytorch and theano, but some commonly used blas libraries are listed below:
cuDNN, cuBLAS, and cuda: nvidia GPU support, most popular library
openCL: common GPU support, I don't know about it at all.
MKL: CPU blas library provided by intel
openBLAS: CPU library

How to reduce the TensorFlow Lite binary size with only the operators needed

The TensorFlow Lite binary size is about 900KB, and is still large for me. I want to know how to reduce the size with only the operators needed for supporting the model?
Tensorflow Lite
If you are using Tensorflow Lite, the only solution I have found is to work at level of Interpreter and customize the Kernel Library (OpResolver). I don't think there is an automatic way of doing this, and the available only example (here the header) is not so easy to understand IMHO. I think that more improvements on this topic will be included in the next releases. Also, I'm not sure this will reduce the size of the final library. In the API notes this approach is considered equivalent to the selective registration, that is explained in the next part of the answer for Tensorflow Mobile.
Tensorflow Mobile
As an answer to the question "How can I enable only the ops used by my model", the answer is in Tensorflow Mobile Documentation (at the subsection Binary Size).
The usual size for Tensorflow Mobile seems to be 12MB, but it is possible to reduce it by including only the model required ops. Obviously this requires to build Tensorflow Lite as a Framework using Bazel.
You can create an header of required ops (ops_to_register.h) using the tool print_selective_registration_header.py, that is available here. The generated header should be placed in the root of the Tensorflow source directory.
You are now ready to compile the library, passing the SELECTIVE_REGISTRATION definition to the compiler (building with Bazel, you should add the option: --copts=”-DSELECTIVE_REGISTRATION”).
I think this procedure will give the library with minimal ops inside. Some other compiler optimization flags may help you with the size (sometimes penalizing performance).
Compile options
I actually don't know how you are compiling your code (static lib or dynamic lib), which are your needs in terms of performance, and which are the default options in Tensorflow bazelfile, but you may try:
to reduce the optimization to -O1 or -Os (sometimes helps with the binary size, and I think the default for Tensorflow is -O2 for the framework and -O3 for the single kernels, I don't know for the lite version though).
use the flags -fdata-section and --gc-sections: quoting gcc documentation: "[-fdata-sections] Together with a linker garbage collection (linker --gc-sections option) these options may lead to smaller statically-linked executables (after stripping)." (It seems that at least --gc-sections is used in linker options for Raspberry Pi)
-fvisibility-inlines-hidden should impact on performance of inline functions, but decreases the size of the export table of the shared object. This option may break the library. Some explanations can be read here.
Even more dangerous is -fvisibility=hidden. Look at it here.