TensorFlow Lite on RISC V newlib - tensorflow

I'd like to compile TensorFlow Lite using the riscv64-unknown-elf (newlib) cross-compiler to run it on Spike or some other RISC-V simulator. AFAIK, no such option exists:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/tools/make/targets/riscv_makefile.inc
It seems that support exists for riscv64-unknown-linux, and for riscv32-unknown-elf (is it as trivial as changing this one to riscv64-unknown-elf? I tried this and I'm getting errors on make)
https://edge.seas.harvard.edu/files/edge/files/carrv_workshop_submission_2019_camera_ready.pdf the authors of this paper successfully compiled TensorFlow Lite using the newlib compiler, but they mention that they had to make changes to the TFLite source code. The changes have not been shared to my knowledge.

Related

Is there any way to make stable-baselines works with Tensorflow 2.0?

I am trying to run a program written before that uses stable baselines. My laptop uses the M1 chip so it doesn't work with Tensorflow 1.x. (I found tutorials on how to use Tensorflow2.x, but not Tensorflow1.x) And if I use stable-baselines3, I will have to change a lot of codes in the program since there're differences between PPO2 and PPO. So I wonder is there any way I can either use Tensorflow 1.x on my mac or use stable-baselines with Tensorflow2.x?

How does Ruy, XNNPACK, and Eigen work in Tensorflow Lite?

I heard from various sources (mostly from the official documents) that Tensorflow Lite (for
ARM) uses these three libraries - Ruy, Eigen, XNNPACK - for its operation.
I understand they somehow accelerate the computation (mainly convolution) in TF Lite, but I'm not exactly sure what purpose each library serves. I know Eigen is a BLAS library, but I'm not sure what others are and how they are related to each other in TF Lite.
Would someone care to explain what different purposes they serve and how they are used in conjunction in TF Lite? (Call Stacks maybe?)
I've been looking around the official documentations of each libraries but I was unable to find much details for Ruy and XNNPACK. Ruy says that it provides efficient matrix multiplication, but isn't that what BLAS libraries do?
Older version of TensorFlow Lite used Eigen and Gemmlowp library to accelerate the computation. However on Arm platforms the performance was worst compared to e.g. Arm Compute Library.
TensorFlow Lite replaced the Eigen and Gemmlowp around version 2.3 and with Ruy matrix multiplication library. They serves similar purpose, but Ruy performs better. The Ruy is default on Arm platform, but you can still compile the TensorFlow Lite without use of Ruy.
XNNPACK outperforms Ruy even more, but it focus solely on operation on float.
Regarding Ruy performance benchmarks check this thread https://github.com/google/ruy/issues/195, and the benchmarks on Pixel4 https://docs.google.com/spreadsheets/d/1CB4gsI7pujNRAf5Iz5vuD783QQqO2zOu8up9IpTKdlU/edit#gid=510573209.

TFlite Select Ops in C++ inference without using FlexDelegate

I'm trying to build a TFLite program to run inference with a model which uses TF Select Ops written in C++ without building the entire tflite delegate library, i.e. without adding flex delegate as a dependency in the BUILD file (using bazel here). Keeping the flex delegate in allows the program to build and run on x86_64, but cross-compilation for RaspberryPI fails, and furthermore, the binary is nearly an order of magnitude larger than expected. Is it possible to use ops which are not natively supported by TFLite in a TFlite C++ program without building the entire delegate library?
I think selective build is what you are looking for: https://www.tensorflow.org/lite/guide/reduce_binary_size
It only links the ops that are used in your models so vastly reduce the library size.
Follow the instruction on that page, you can produce .aar files, extracting that file you will find the .so libraries.

Workflow for converting .pb files to .tflite

Overview
I know this subject has been discussed many times, but I am having a hard time understanding the workflow, or rather, the variations of the workflow.
For example, imagine you are installing TensorFlow on Windows 10. The main goal being to train a custom model, convert to TensorFlow Lite, and copy the converted .tflite file to a Raspberry Pi running TensorFlow Lite.
The confusion for me starts with the conversion process. After following along with multiple guides, it seems TensorFlow is often install with pip, or Anaconda. But then I see detailed tutorials which indicate it needs to be built from source in order to convert from TensorFlow models to TFLite models.
To make things more interesting, I've also seen models which are converted via Python scripts as seen here.
Question
So far I have seen 3 ways to do this conversion, and it could just be that I don't have a grasp on the full picture. Below are the abbreviated methods I have seen:
Build from source, and use the TensorFlow Lite Optimizing Converter (TOCO):
bazel run --config=opt tensorflow/lite/toco:toco -- --input_file=$OUTPUT_DIR/tflite_graph.pb --output_file=$OUTPUT_DIR/detect.tflite ...
Use the TensorFlow Lite Converter Python API:
converter = tf.lite.TFLiteConverter.from_saved_model(export_dir)
tflite_model = converter.convert()
with tf.io.gfile.GFile('model.tflite', 'wb') as f:
f.write(tflite_model)
Use the tflite_convert CLI utilities:
tflite_convert --saved_model_dir=/tmp/mobilenet_saved_model --output_file=/tmp/mobilenet.tflite
I *think I understand that options 2/3 are the same, in the sense that the tflite_convert utility is installed, and can be invoked either from the command line, or through a Python script. But is there a specific reason you should choose one over the other?
And lastly, what really gets me confused is option 1. And maybe it's a version thing (1.x vs 2.x)? But what's the difference between the TensorFlow Lite Optimizing Converter (TOCO) and the TensorFlow Lite Converter. It appears that in order to use TOCO you would have to build TensorFlow from source, so is there is a reason you would use one over the other?
There is no difference in the output from different conversion methods, as long as the parameters remain the same. The Python API is better if you want to generate TFLite models in an automated way (for eg a Python script that's run periodically).
The TensorFlow Lite Optimizing Converter (TOCO) was the first version of the TF->TFLite converter. It was recently deprecated and replaced with a new converter that can handle more ops/models. So I wouldn't recommend using toco:toco via bazel, but rather use tflite_convert as mentioned here.
You should never have to build the converter from source, unless you are making some changes to it and want to test them out.

How to reduce the TensorFlow Lite binary size with only the operators needed

The TensorFlow Lite binary size is about 900KB, and is still large for me. I want to know how to reduce the size with only the operators needed for supporting the model?
Tensorflow Lite
If you are using Tensorflow Lite, the only solution I have found is to work at level of Interpreter and customize the Kernel Library (OpResolver). I don't think there is an automatic way of doing this, and the available only example (here the header) is not so easy to understand IMHO. I think that more improvements on this topic will be included in the next releases. Also, I'm not sure this will reduce the size of the final library. In the API notes this approach is considered equivalent to the selective registration, that is explained in the next part of the answer for Tensorflow Mobile.
Tensorflow Mobile
As an answer to the question "How can I enable only the ops used by my model", the answer is in Tensorflow Mobile Documentation (at the subsection Binary Size).
The usual size for Tensorflow Mobile seems to be 12MB, but it is possible to reduce it by including only the model required ops. Obviously this requires to build Tensorflow Lite as a Framework using Bazel.
You can create an header of required ops (ops_to_register.h) using the tool print_selective_registration_header.py, that is available here. The generated header should be placed in the root of the Tensorflow source directory.
You are now ready to compile the library, passing the SELECTIVE_REGISTRATION definition to the compiler (building with Bazel, you should add the option: --copts=”-DSELECTIVE_REGISTRATION”).
I think this procedure will give the library with minimal ops inside. Some other compiler optimization flags may help you with the size (sometimes penalizing performance).
Compile options
I actually don't know how you are compiling your code (static lib or dynamic lib), which are your needs in terms of performance, and which are the default options in Tensorflow bazelfile, but you may try:
to reduce the optimization to -O1 or -Os (sometimes helps with the binary size, and I think the default for Tensorflow is -O2 for the framework and -O3 for the single kernels, I don't know for the lite version though).
use the flags -fdata-section and --gc-sections: quoting gcc documentation: "[-fdata-sections] Together with a linker garbage collection (linker --gc-sections option) these options may lead to smaller statically-linked executables (after stripping)." (It seems that at least --gc-sections is used in linker options for Raspberry Pi)
-fvisibility-inlines-hidden should impact on performance of inline functions, but decreases the size of the export table of the shared object. This option may break the library. Some explanations can be read here.
Even more dangerous is -fvisibility=hidden. Look at it here.