I want to figure out what actually is that TensorRT guys name "engine". I want to know this because I am not sure if I will be able to use the same engine to infer on top of different GPUs real architectures.
I know that there is a sort of code that execute the neural network inference step. I want to figure out if it contains cuda PTX code (a sort of bytecode interpreted by the CUDA JIT) or maybe it is an actual binary file compiled for a given GPU architecture.
I expect it to be a sort of portable bytecode.
Do you have any clue?
Thanks a lot!
I want to know this because I am not sure if I will be able to use the same engine to infer on top of different GPUs real architectures
TensorRT models created are optimized according to the GPU architecture they are built upon. So, engine built on one GPU architecture should not be used on a different architecture.
Related
I realize this is not the intended usage of TensorRT, but I am a bit stuck so maybe there are some ideas out there. Currently I have been provided some neural network models as TensorRT serialized engines, so-called .trt files. These are basically models compiled and optimized from PyTorch to run on a specific GPU.
Now, this works fine since I do have a compatible GPU for development, however, for setting up CI/CD, I am having some trouble because the cloud servers on which it will be running for testing purposes only do not have adequate GPUs for this CUDA-compiled "engine".
So, I would like to force these models to run on CPU, or otherwise find some other way to make them run. On CPU would be just fine, because I just need to run handful of inferences to check the output, it is fine if it's slow. Again, I know this is not the intended usage of TensorRT, but I need some output from the models for integration testing.
Alternative approach
The other idea I had was maybe to convert the .trt files back to .onnx or another format that I could load into another runtime engine, or just into PyTorch or TensorFlow, but I cannot find any TensorRT tools that load an engine and write a model file. Presumably because it is "compiled" and no longer convertible; yet, the model parameters must be in there, so does anyone know how to do such a thing?
I am trying to understand how the internal flow goes in mxnet when we call forward . Is there any way to get source code of mxnet?
This really depends on what your symbolic graph looks like. I assume you use MXNet with Python (Python documentation). There you can choose to use the MXNet symbol library or the Gluon library.
Now, you were asking whether one can inspect the code, and, yes, you can find it on GitHub. The folder python contains the python interface and src contains all MXNet sources. What happens on forward is eventually defined by the MXNet execution engine, which tracks input/output dependencies of operators and neural network layers, allocate memory on the different devices (CPU, GPUs). There is a general architecture documentation for this.
I suppose you are interested in what each and every operation does, such as argmax (reduction), tanh (unary math operation) or convolution (complex neural network operation). This you can find in the operator folder of MXNet. This requires a whole documentation in itself and there is a special forum for MXNet specifics here, but I will give a short orientation:
Each operation in a (symbolic) execution graph needs a defined forward and backward operation. It also needs to define its output shape, so that it can be chained with other operations. If that operator needs weights, it needs to define the amount of memory it requires, so MXNet can allocate it.
Each operation requires several implementations for a) CPU b) GPU (CUDA) c) wrapper around cuDNN
All unary math operations follow the same pattern, so they are all defined in a similar way in mshadow_op.h (e.g. relu).
This is all I can tell you based on your quite broad question.
The TensorFlow Lite binary size is about 900KB, and is still large for me. I want to know how to reduce the size with only the operators needed for supporting the model?
Tensorflow Lite
If you are using Tensorflow Lite, the only solution I have found is to work at level of Interpreter and customize the Kernel Library (OpResolver). I don't think there is an automatic way of doing this, and the available only example (here the header) is not so easy to understand IMHO. I think that more improvements on this topic will be included in the next releases. Also, I'm not sure this will reduce the size of the final library. In the API notes this approach is considered equivalent to the selective registration, that is explained in the next part of the answer for Tensorflow Mobile.
Tensorflow Mobile
As an answer to the question "How can I enable only the ops used by my model", the answer is in Tensorflow Mobile Documentation (at the subsection Binary Size).
The usual size for Tensorflow Mobile seems to be 12MB, but it is possible to reduce it by including only the model required ops. Obviously this requires to build Tensorflow Lite as a Framework using Bazel.
You can create an header of required ops (ops_to_register.h) using the tool print_selective_registration_header.py, that is available here. The generated header should be placed in the root of the Tensorflow source directory.
You are now ready to compile the library, passing the SELECTIVE_REGISTRATION definition to the compiler (building with Bazel, you should add the option: --copts=”-DSELECTIVE_REGISTRATION”).
I think this procedure will give the library with minimal ops inside. Some other compiler optimization flags may help you with the size (sometimes penalizing performance).
Compile options
I actually don't know how you are compiling your code (static lib or dynamic lib), which are your needs in terms of performance, and which are the default options in Tensorflow bazelfile, but you may try:
to reduce the optimization to -O1 or -Os (sometimes helps with the binary size, and I think the default for Tensorflow is -O2 for the framework and -O3 for the single kernels, I don't know for the lite version though).
use the flags -fdata-section and --gc-sections: quoting gcc documentation: "[-fdata-sections] Together with a linker garbage collection (linker --gc-sections option) these options may lead to smaller statically-linked executables (after stripping)." (It seems that at least --gc-sections is used in linker options for Raspberry Pi)
-fvisibility-inlines-hidden should impact on performance of inline functions, but decreases the size of the export table of the shared object. This option may break the library. Some explanations can be read here.
Even more dangerous is -fvisibility=hidden. Look at it here.
I am currently using Lenovo Ideapad PC with AMD Radeon graphics in it. I am trying to run an image classifier model using convolutional neural networks. The dataset contains 50000 images and it takes too long to train the model. Can someone tell me how can I use my AMD GPU to fasten the process. I think AMD Graphics does not support CUDA. So is there any way around?
PS: I am using Ubuntu 17.10
What you're asking for is OpenCL support, or in more grandiose terms: the democratization of accelerated devices. There seems to be tentative support for OpenCL, I see some people testing it as of early 2018, but it doesn't appear fully baked yet. The issue has been tracked for quite some time here:
https://github.com/tensorflow/tensorflow/issues/22
You should also be aware of development on XLA, an attempt to virtualize tensorflow over an LLVM (or LLVM-like) virtualization layer making it more portable. It's currently cited as being in alpha as of early 2018.
https://www.tensorflow.org/performance/xla/
There isn't yet a simple solution, but these are the two efforts to follow along these lines.
I want to study on the research of deep learning, but I don't know which framwork should I choice between TensorFlow and PaddlePaddle. who can make a contrast between the two frameworks? which one is better? especially in the running efficiency of CPU
It really depends what you are shooting for...
If you plan on training, CPU is not going to work well for you. Use colab or kaggle.
Assuming you do get a GPU, it depends if you want to focus on classification or object detection.
If you focus on classification, Keras is probably the easiest to work with or pytorch if you want some advanced stuff and to be able to change things.
If you plan on object detection, things are getting complicated... Inference is reasonably easy but training is complicated. There are actually 4 platforms you should consider:
Tensorflow - powerful but very difficult to work with. If you do not use Keras (and for OD you usually can't), you need to preprocess the dataset into tfrecords and it is a pain. The OD Api has very cryptic messages and it is very sensitive to the combination of tf version and api version. On the other hand, cool models like efficientdet are more or less easy to use.
MMdetection - very powerful framework, has lots of advanced models and once you understand how to work with it, you can easily work with and of the models it supports. Downside is that some models are slow to arrive (efficientdet, for example)
paddlepaddle - if you know Chinese, this should work ok, maybe. The documentation is a bit behind and usually requires lots of improvisation. Basically it is similar to mmdetection just with a few unique models and a few missing models.
detectron2 - I didn't work with this one, but it seems to support only a few models.
You probably need first to define for yourself what do you want to do and then choose.
Good luck!
It is not that trivial. Some models run faster with one kind of framework others with another. Furthermore, it depends on the hardware as well. See this blog. If inference is your only concern, then you can develop your model in any of the popular frameworks like TensorFlow, PyTorch, etc. In the end convert your model to ONNX format and benchmark its performance with DNN-Bench to choose the best inference engine for your application.