MxNet for Embedded device inference - mxnet

Is there a way to take MxNet model adn deploy it to embedded device directly? As "embedded", objective is to have super lightweight, optionally optimized for ARM/neon.

Yes, here is the tutorial which explains step by step how to run MxNet inference ARM-device (Raspberry Pi in that case) -


What are the main differences between TensorFlowLite, TendorFlow-TRT and TensorRT?

I am using the Coral devboard and the Nvidia Jetson TX2. And that is how I got to know about TensorFlow-Lite, TensorFlow-TRT and TensorRT.
I have some questions about them:
Between TensorFlow-TRT and TensorRT:
When using a fully optimised/compatible graph with TensorRT, which one is faster and why?
The pipeline to use TFlite in a Google Coral (When using TensorFlow 1.x...) is:
a. Use a model available in TensorFlow's zoo
b. Convert the model to frozen (.pb)
c. Use protobuff to serialize the graph
d. Convert to Tflite
e. Apply quantization (INT8)
f. Compile
what would be the pipeline when using TensorFlow-TRT and TensorRT?
Is there somewhere where I can find a good documentation about it?
So far I think TensorRT is closer to TensorFlow Lite because:
TFlite: after compilation you end up with a .quant.edtpu.tflite file which can be used to make inference in the devboard
TensorRT: you will end up with a .plan file to make inference in the devboard.
Thank you for the answers, and if you can point me to documentation which compares them, that will be appreciated.
TensorRT is a very fast CUDA runtime for GPU only. I am using an Nvidia Jetson Xavier NX with Tensorflow models converted to TensorRT, running on the Tensorflow-RT (TRT) runtime. The benefit of TRT runtime is any unsupported operations on TensorRT will fall back to using Tensorflow.
Have not tried Tensorflow-Lite, but I understand it as a reduced TF for inference-only on "small devices". It can support GPU but only limited operations and I think there are no python bindings (currently).

How to optimize your tensorflow model by using TensorRT?

These are the instruction to solve the assignments?
Convert your TensorFlow model to UFF
Use TensorRT’s C++ API to parse your model to convert it to a CUDA engine.
TensorRT engine would automatically optimize your model and perform steps
like fusing layers, converting the weights to FP16 (or INT8 if you prefer) and
optimize to run on Tensor Cores, and so on.
Can anyone tell me how to proceed with this assignment because I don't have GPU in my laptop and is it possible to do this in google colab or AWS free account.
And what are the things or packages I have to install for running TensorRT in my laptop or google colab?
so I haven't used .uff but I used .onnx but from what I've seen the process is similar.
According to the documentation, with TensorFlow you can do something like:
from tensorflow.python.compiler.tensorrt import trt_convert as trt
converter = trt.TrtGraphConverter(
nodes_blacklist=['logits', 'classes'])
frozen_graph = converter.convert()
In TensorFlow1.0, so they have it pretty straight forward, TrtGraphConverter has the option to serialized for FP16 like:
converter = trt.TrtGraphConverter(
See the preciosion_mode part, once you have serialized you can load the networks easily on TensorRT, some good examples using cpp are here.
Unfortunately, you'll need a nvidia gpu with FP16 support, check this support matrix.
If I'm correct, Google Colab offered a Tesla K80 GPU which does not have FP16 support. I'm not sure about AWS but I'm certain the free tier does not have gpus.
Your cheapest option could be buying a Jetson Nano which is around ~90$, it's a very powerful board and I'm sure you'll use it in the future. Or you could rent some AWS gpu server, but that is a bit expensive and the setup progress is a pain.
Best of luck!
Export and convert your TensorFlow model into .onnx file.
Then, use this onnx-tensorrt tool to do the CUDA engine file conversion.

OpenVINO - Toolkit with YoloV4

I am currently working with the YoloV3-tiny.
To import the network into C++ project I use OpenVINO-Toolkit. In more detail I use the following procedure to convert the network:
Converting YOLO* Models to the Intermediate Representation (IR)
This procedure carries out a conversion and an optimization to proceed with the inference.
Now, I would like to try the YoloV4 because it seems to be more effective for the purpose of the project. The problem is that OpenVINO Toolkit does not yet support this version and does not report the .json (file needed for optimization) file relative to version 4 but only up to version 3.
What has changed in terms of structure between version 3 and version 4 of the Yolo?
Can I hopefully hope that the conversion of the YoloV3-tiny (or YoloV3) is the same as the YoloV4?
Is the YoloV4 much slower than the YoloV3-tiny using only the CPU for inference?
When will the YoloV4-tiny be available?
Does anyone have information about it?
The difference between YoloV4 and YoloV3 is the backbone. YoloV4 has CSPDarknet53, whilst YoloV3 has Darknet53 backbone.
Also, YoloV4 is not supported officially by OpenVINO. However, you can still test and validate YoloV4 on your end with some workaround. There is one way for now to run YoloV4 through OpenCV which will build network using nGraph API and then pass to Inference Engine. See
The key problem is the Mish activation function - there is no optimized implementation yet, which is why we have to implement it by definition with tanh and exponential functions. Unfortunately, one-to-one topology comparison shows significant performance degradation. The performance results are also available in the github link above.
This is my project based on v3's converter (darknet -> tensorflow ->IR)and i have finished the adaptation of OpenVINO Yolov4,v4-relu,v4-tiny.
You could have a try. And you can use V4's IRmodel and run on v3's c++ demo directly

Run Faster-rcnn on mobile iOS

I have faster rcnn model that I trained and work on my google cloud instance with GPU ( train with google models API),
I want to run it on mobile, I found some GitHub that shows how to run SSDmobileNet but I could not found one that runs Faster-rcnn.
real time is not my concern for now.
I have iPhone 6, iOS 11.4
The model can be run with Metal, CoreML, tensorflow-lite...
but for POC I need it to run on mobile without train new network.
any help?
Faster R-CNN requires a number of custom layers that are not available in Metal, CoreML, etc. You will have to implement these custom layers yourself (or hire someone to implement them for you, wink wink).
I'm not sure if TF-lite will work. It only supports a limited number of operations on iOS, so chances are it won't have everything that Faster R-CNN needs. But that would be the first thing to try. If that doesn't work, I would try a Core ML model with custom layers.
See here info about custom layers in Core ML:

Does Gensim library support GPU acceleration?

Using Word2vec and Doc2vec methods provided by Gensim, they have a distributed version which uses BLAS, ATLAS, etc to speedup (details here). However, is it supporting GPU mode? Is it possible to get GPU working if using Gensim?
Thank you for your question. Using GPU is on the Gensim roadmap. Will appreciate any input that you have about it.
There is a version of word2vec running on keras by #niitsuma called word2veckeras.
The code that runs on latest Keras version is in this fork and branch
#SimonPavlik has run performance test on this code. He found that a single gpu is slower than multiple CPUs for word2vec.